# Traveling Salesman Problem and Sports Analytics

## Set-up:

In [1]:
import numpy as np
from numpy import unravel_index
import pandas as pd
from gurobipy import *
import matplotlib.pyplot as plt
import math
import random

## Question 1: Cargo freight scheduling

### Part 1: Building our data

Let's load our cargo and city location data first to see the data we are working with.

In [2]:
df_cargo = pd.read_csv('cargo-city-locations.csv')
df_cargo.head()

Unnamed: 0,State,City,Latitude,Longitude
0,Alabama,Montgomery,32.377716,-86.300568
1,Alaska,Juneau,58.301598,-134.420212
2,Arizona,Phoenix,33.448143,-112.096962
3,Arkansas,Little Rock,34.746613,-92.288986
4,California,Sacramento,38.576668,-121.493629


First, we will need to convert the latitudes and longitudes from degrees to radians and add these new columns to our dataframe.

In [3]:
pi = math.pi
latitude_radians = []
longitude_radians = []

for i in range(len(df_cargo)):
    lat_radians = (df_cargo.iloc[i][2] / 360) * 2 * pi 
    lon_radians = (df_cargo.iloc[i][3] / 360) * 2 * pi 
    latitude_radians.append(lat_radians)
    longitude_radians.append(lon_radians)

df_cargo['Lat_radians'] = latitude_radians
df_cargo['Lon_radians'] = longitude_radians

df_cargo.head()

Unnamed: 0,State,City,Latitude,Longitude,Lat_radians,Lon_radians
0,Alabama,Montgomery,32.377716,-86.300568,0.565098,-1.506229
1,Alaska,Juneau,58.301598,-134.420212,1.017555,-2.346075
2,Arizona,Phoenix,33.448143,-112.096962,0.58378,-1.956461
3,Arkansas,Little Rock,34.746613,-92.288986,0.606443,-1.610747
4,California,Sacramento,38.576668,-121.493629,0.67329,-2.120464


Next, let's calculate the distance between all cities.

In [4]:
city_distances = np.empty((68, 68))

# First, use the haversine formula to calculate angle corresponding to the great circle distance
for i in range(68):
    for j in range(68):
        city_distances[i][j] = (1 - np.cos(df_cargo.iloc[j][4] - df_cargo.iloc[i][4])) / 2 + np.cos(df_cargo.iloc[i][4]) * np.cos(df_cargo.iloc[j][4]) * (1 - np.cos(df_cargo.iloc[j][5] - df_cargo.iloc[i][5])) / 2

# Next, we can calculate distance using our angle values obtained above
r_earth = 6378.137
for i in range(68):
    for j in range(68):
        city_distances[i][j] = 2 * r_earth * np.arcsin(math.sqrt(city_distances[i][j]))
        
# Now, let's calculate travel times from the distances
travel_times = city_distances
travel_times = travel_times / 908

Below, let's see our dataframe matrix containing travel times between each city.

In [5]:
location_names = []

for i in range(68):
    full_name = df_cargo.iloc[i][1] + ', ' + df_cargo.iloc[i][0]
    location_names.append(full_name)

df_travel_times = pd.DataFrame(travel_times, columns = location_names, index = location_names)
df_travel_times

Unnamed: 0,"Montgomery, Alabama","Juneau, Alaska","Phoenix, Arizona","Little Rock, Arkansas","Sacramento, California","Denver, Colorado","Hartford, Connecticut","Dover, Delaware","Honolulu, Hawaii","Tallahassee, Florida",...,"San Jose, California","Jacksonville, Florida","Fort Worth, Texas","San Francisco, California","Charlotte, North Carolina","Seattle, Washington","Washington, DC","El Paso, Texas","Detroit, Michigan","Portland, Oregon"
"Montgomery, Alabama",0.000000,5.056977,2.651445,0.677043,3.571961,2.054484,1.756424,1.355123,7.806145,0.318083,...,3.609520,0.547065,1.140090,3.660538,0.656446,3.817057,1.220526,2.095221,1.260576,3.781753
"Juneau, Alaska",5.056977,0.000000,3.556684,4.459943,2.626347,3.234904,5.052450,5.101044,4.987879,5.373534,...,2.758958,5.546746,4.367639,2.692515,5.123105,1.581169,5.025938,3.995209,4.336216,1.797078
"Phoenix, Arizona",2.651445,3.556684,0.000000,2.013975,1.123168,1.040876,3.924564,3.655538,5.154748,2.908468,...,1.087630,3.180316,1.517599,1.156569,3.158433,1.976536,3.511184,0.615248,2.996232,1.781742
"Little Rock, Arkansas",0.677043,4.459943,2.013975,0.000000,2.897753,1.380228,2.068074,1.726577,7.161907,0.980876,...,2.939102,1.224064,0.568866,2.988580,1.150530,3.160302,1.578932,1.499250,1.283099,3.114263
"Sacramento, California",3.571961,2.626347,1.123168,2.897753,0.000000,1.573601,4.527916,4.340942,4.364627,3.861098,...,0.156507,4.117976,2.501850,0.132749,3.973959,1.109531,4.208082,1.717417,3.585342,0.856171
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"Seattle, Washington",3.817057,1.581169,1.976536,3.160302,1.109531,1.809405,4.310989,4.229411,4.754172,4.132303,...,1.259481,4.349978,2.943722,1.205329,4.047216,0.000000,4.119428,2.441116,3.430231,0.259208
"Washington, DC",1.220526,5.025938,3.511184,1.578932,4.208082,2.641799,0.536792,0.147702,8.568040,1.268460,...,4.285926,1.149489,2.147768,4.320413,0.584850,4.119428,0.000000,3.057566,0.698925,4.164774
"El Paso, Texas",2.095221,3.995209,0.615248,1.499250,1.717417,0.989286,3.511996,3.204749,5.725878,2.332492,...,1.696044,2.608214,0.956536,1.762804,2.648236,2.441116,3.057566,0.000000,2.621474,2.280987
"Detroit, Michigan",1.260576,4.336216,2.996232,1.283099,3.585342,2.047712,0.945475,0.799627,7.949315,1.463141,...,3.674839,1.477416,1.813918,3.703357,0.895963,3.430231,0.698925,2.621474,0.000000,3.485486


#### a) Two cities with the highest travel time

First, let's find the index with the highest travel time.

In [6]:
result = np.where(travel_times == np.amax(travel_times))
print(result)

(array([ 8, 18], dtype=int64), array([18,  8], dtype=int64))


As we can see, the highest travel time occurs at index (8, 18) or (18, 8) and we can find which cities this corresponds to along with the travel time.

In [8]:
highest_time = round(df_travel_times.iloc[8][18], 5)
highest_city1 = df_travel_times.iloc[8].name
highest_city2 = df_travel_times.iloc[18].name

print('Highest travel time occurs between', highest_city1, 'and', highest_city2, 'with a travel time of', highest_time, 'hours.')

Highest travel time occurs between Honolulu, Hawaii and Augusta, Maine with a travel time of 9.06815 hours.


#### b) Two cities with the smallest travel time

First, let's find the index with the lowest travel time.

In [9]:
df_travel_times.replace(0, 100, inplace=True)
result2 = np.where(travel_times == np.amin(travel_times))
print(result2)

(array([29, 54], dtype=int64), array([54, 29], dtype=int64))


As we can see, the lowest travel time occurs at index (29, 54) or (54, 29) and we can find which cities this corresponds to along with the travel time.

In [10]:
lowest_time = round(df_travel_times.iloc[29][54], 5)
lowest_city1 = df_travel_times.iloc[29].name
lowest_city2 = df_travel_times.iloc[54].name

print('Highest travel time occurs between', lowest_city1, 'and', lowest_city2, 'with a travel time of', lowest_time, 'hours.')

Highest travel time occurs between Trenton, New Jersey and Philadelphia, Pennsylvania with a travel time of 0.04954 hours.


#### c) City with the smallest average travel time to all other cities

To find city with smallest average travel time, let's take the average of each column in our dataframe.

In [11]:
df_travel_times.replace(100, np.nan, inplace=True)
avg_times = df_travel_times.mean(axis = 0, skipna = True)
smallest_avg = np.min(avg_times)
conditon = (avg_times == smallest_avg)
result3 = np.where(conditon)
result3

(array([12], dtype=int64),)

We find the smallest average travel time to be the city with index 12.

In [12]:
city_smallest_avg = df_travel_times.iloc[12].name

print('City with the smallest average travel time to all other cities:', city_smallest_avg)
print('Average travel time of Springfield, Illinois:', round(smallest_avg, 4))

City with the smallest average travel time to all other cities: Springfield, Illinois
Average travel time of Springfield, Illinois: 1.599


### Part 2: Finding a schedule

#### a)

Now, let's randomly generate 100 sequences of the 68 cities.

In [13]:
# Set parameters and array for our 100 random sequences of 68 cities
nCities = 68
random_cities = np.zeros((100, 69))

# Set random seed to 50
np.random.seed(50)

# Generate 100 random sequences of the 68 cities
for i in range(100):
    temp = np.random.permutation(nCities)
    temp = np.append(temp, temp[0])
    random_cities[i] = temp

After generating 100 random sequences of the 68 cities, let's calculate total travel time required to visit the cities according to each sequence as well as the average total travel times of the 100 randomly generated sequences.

In [14]:
random_city_times = []

for i in range(100):
    total_time = 0
    for j in range(68):
        origin = int(random_cities[i][j])
        destination = int(random_cities[i][j+1])
        total_time += df_travel_times.iloc[origin][destination]
    random_city_times.append(total_time)

avg_city_times = round(sum(random_city_times) / len(random_city_times), 5)
print("The average of the total travel times of the 100 randomly generated sequences is " + str(avg_city_times) + " hours.")

The average of the total travel times of the 100 randomly generated sequences is 149.46322 hours.


#### b)

Now, let's design the sequence of cities starting with Los Angeles and visiting the next city that is closest to the current city in travel time and has not been visited yet.

In [15]:
# We will start with LA as our first city
df_travel_times.replace(np.nan, 0, inplace=True)
LA_startpoint_time = 0
already_visited = [51]
travel_times_b = df_travel_times.iloc[51].tolist()

# Find the next 67 closest cities in a row
for i in range(67):
    j = 1
    sorted_times = sorted(travel_times_b)
    min_value = sorted_times[j]
    min_index = travel_times_b.index(min_value)
    while min_index in already_visited:
        j += 1
        min_value = sorted_times[j]
        min_index = travel_times_b.index(min_value)    
    else:
        LA_startpoint_time += min_value
        already_visited.append(min_index)
        travel_times_b = df_travel_times.iloc[min_index].tolist()

# Add in travel time returning to LA
return_to_LA = df_travel_times.iloc[already_visited[67]][51]
LA_startpoint_time += return_to_LA
print("The total travel time when starting with LA and moving to the next closest city is " + str(round(LA_startpoint_time, 2)) + " hours.")

The total travel time when starting with LA and moving to the next closest city is 35.85 hours.


#### c)

Now we will solve an optimization problem to find the order in which cities should be visited so as to minimize total travel time. 

First, let's build our dictionary of travel times and keys for the travel times.

In [16]:
travel_time_dict = {}
for i in range(len(df_travel_times)):
    for j in range(len(df_travel_times)):
        if i != j:
            travel_time_dict[(i, j)] = df_travel_times.iloc[i, j]
            
od_pairs = travel_time_dict.keys()

Next, let's define some important functions we will use in our optimization model.

In [18]:
# Function that will output all subtours
def getSubtours(sequence):
    subtour_list = []
    unvisited = list(range(nCities))
    
    while ( len(unvisited) > 0 ):
        node = unvisited.pop()
        
        subtour = []
        subtour.append(node)
        
        next_node = list(filter(lambda t: t[0] == node, sequence))[0][1]
        
        while (next_node in unvisited):
            subtour.append(next_node)
            unvisited.remove(next_node)
            next_node = list(filter(lambda t: t[0] == next_node, sequence))[0][1]
            
        subtour_list.append(subtour)
    
    return subtour_list

# Function that will eliminate existing subtours
def eliminateSubtours(model, where):
    if (where == GRB.Callback.MIPSOL):
        x_val = model.cbGetSolution(x)
        sequence = [ (i,j) for (i,j) in od_pairs if x_val[i,j] > 0.5]
        subtour_list = getSubtours(sequence)
        if (len(subtour_list) > 1):
            for subtour in subtour_list:
                model.cbLazy( sum(x[i,j] for i in subtour for j in subtour if i != j) <= len(subtour) - 1)

Now let's implemenet the model below and find our optimally minimized travel time.

In [21]:
m = Model()

x = m.addVars(od_pairs, vtype = GRB.BINARY)

for i in range(nCities):
    m.addConstr( sum(x[i,j] for j in range(nCities) if j != i ) == 1)
    m.addConstr( sum(x[j,i] for j in range(nCities) if j != i ) == 1)

m.setObjective(sum( travel_time_dict[i,j] * x[i,j] for (i,j) in od_pairs ), GRB.MINIMIZE)

m.update()

# Enable lazy constraints
m.params.LazyConstraints = 1

# Supply the callback to Gurobi:
m.optimize(eliminateSubtours)

optimal_travel = m.objval
print("Optimal travel time:", round(optimal_travel, 2), 'hours')

Set parameter LazyConstraints to value 1
Gurobi Optimizer version 9.5.0 build v9.5.0rc5 (win64)
Thread count: 6 physical cores, 12 logical processors, using up to 12 threads
Optimize a model with 136 rows, 4556 columns and 9112 nonzeros
Model fingerprint: 0x5c4ba032
Variable types: 0 continuous, 4556 integer (4556 binary)
Coefficient statistics:
  Matrix range     [1e+00, 1e+00]
  Objective range  [5e-02, 9e+00]
  Bounds range     [1e+00, 1e+00]
  RHS range        [1e+00, 1e+00]
Presolve time: 0.01s
Presolved: 136 rows, 4556 columns, 9112 nonzeros
Variable types: 0 continuous, 4556 integer (4556 binary)

Root relaxation: objective 2.729102e+01, 162 iterations, 0.00 seconds (0.00 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0   29.74856    0   94          -   29.74856      -     -    0s
     0     0   29.90657    0   26          -   29.90657      -     -    0

## Question 2: Winning the NBA championship

### Part 3: A dream team

First, let's load in our player data to see what we're working with.

In [69]:
df_nba = pd.read_csv('nba-players-2018-2019-with-pos.csv')
df_nba.head()

Unnamed: 0,Player,Pos,Tm,X3PA,X2PA,FGA,FTA,ORB,DRB,AST,STL,BLK,TOV,PF,Salary
0,Alex Abrines,SG,OKC,127,30,157,13,5,43,20,17,6,14,53,5455236.0
1,Quincy Acy,PF,PHO,15,3,18,10,3,22,8,1,4,4,24,213949.0
2,Jaylen Adams,PG,ATL,74,36,110,9,11,49,65,14,5,28,45,236854.0
3,Steven Adams,C,OKC,2,807,809,292,391,369,124,117,76,135,204,24157304.0
4,Bam Adebayo,C,MIA,15,471,486,226,165,432,184,71,65,121,203,2955840.0


Now, let's build a linear optimization model that will maximize the predicted points the team would score. First, let's create some parameters and data that will help with setting up the model.

In [70]:
# Parameters
n_stats = 10 # number of relevant stats
n_players = 522 # total players available
n_positions = 5 # number of basketball positions

# Some relevant data
coeff = [0.69129, 0.36184, 0.78122, -0.20277, 0.61064, 0.91510, 0.50506, 0.07975, -0.62701, 0.30420] # Linear regression coefficients from Part 2
salary = df_nba["Salary"] # all NBA player salaries
stats = df_nba.drop(['Player', 'Pos', 'Tm', 'FGA','Salary'], axis = 1).to_numpy() # Statistics data for each player

# Setting up position data numerically
positions = df_nba['Pos'].to_frame()
positions = positions.replace({'Pos':{'PG':1, 'SG':10, 'SF':100, 'PF':1000, 'C':10000}})
positions = positions.squeeze()

After setting up the data and parameters, let's use Gurobi to build our model.

In [71]:
# Create the model. 
m_nba = Model()

# Define our decision variables 
x = m_nba.addVars(n_players, vtype = GRB.BINARY) # binary variable for the players selected

# Create constraints regarding total players and salary on an NBA team
player_constr = m_nba.addConstr(sum(x[i] for i in range(n_players)) == 15)
salary_constr = m_nba.addConstr(sum(x[i] * salary[i] for i in range(n_players)) <= 100000000)
position_constr = m_nba.addConstr(sum(x[i] * positions[i] for i in range(n_players)) == 33333)

# Create the objective function.
m_nba.setObjective(513.833 + sum(coeff[j] * x[i] * stats[i, j] for i in range(n_players) for j in range(n_stats)), GRB.MAXIMIZE)

m_nba.update()

m_nba.optimize()

Gurobi Optimizer version 9.5.0 build v9.5.0rc5 (win64)
Thread count: 6 physical cores, 12 logical processors, using up to 12 threads
Optimize a model with 3 rows, 522 columns and 1566 nonzeros
Model fingerprint: 0x2daec752
Variable types: 0 continuous, 522 integer (522 binary)
Coefficient statistics:
  Matrix range     [1e+00, 4e+07]
  Objective range  [2e-01, 2e+03]
  Bounds range     [1e+00, 1e+00]
  RHS range        [2e+01, 1e+08]
Presolve time: 0.00s
Presolved: 3 rows, 522 columns, 1458 nonzeros
Variable types: 0 continuous, 522 integer (522 binary)

Root relaxation: objective 2.200651e+04, 6 iterations, 0.00 seconds (0.00 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0 22006.5109    0    3          - 22006.5109      -     -    0s
H    0     0                    18285.856890 22006.5109  20.3%     -    0s
H    0     0                    20386.097860 22006.

Below are the dream team players along with the predicted number of points that the team will score in the upcoming season.

In [72]:
optimal_players = [i for i in range(n_players) if x[i].x > 0.0]
print("Optimal players:")
for i in optimal_players:
    print('->',df_nba.iloc[:,0][i], '-',df_nba.iloc[:,1][i])

optimal_points = m_nba.objval
print('')
print("Optimal points: ",round(optimal_points, 2))

Optimal players:
-> Giannis Antetokounmpo - PF
-> Devin Booker - SG
-> Luka Doncic - SG
-> De'Aaron Fox - PG
-> Kyle Kuzma - PF
-> Donovan Mitchell - SG
-> Cedi Osman - SF
-> Domantas Sabonis - C
-> Pascal Siakam - PF
-> Jayson Tatum - SF
-> Karl-Anthony Towns - C
-> Nikola Vucevic - C
-> Kemba Walker - PG
-> Justise Winslow - SF
-> Trae Young - PG

Optimal points:  20577.64
