# Regression - Predicting Lineup Performance

This Notebook attempts to predict the performance of a hypothetical lineup in terms of any advanced statistic. In order to do so, a regression model will be trained from scratch by using simple linear regression, which is included in the group of supervised methods. In this case, there was no need to label data manually; instead, all possible on-court combinations have been obtained from play-by-play data together with their corresponding statistics (both on offense and defense). Therefore, by modelling each player with features and by building a unique feature vector for each combination, a linear model can be trained whilst using the associated stats as ground-truth. Please note that the given example provides a simple baseline, but many other rellevant factors haven't been taken into account yet. However, if you are interested in a Deep Learning Keras/Pytorch-based model with a FCL-architecture, let me know! 

Sidenote: I don't really know why, but I've always been more comfortable when using .npy files instead of .csv's; if you prefer the latter, feel free to adapt the code with these files: https://drive.google.com/drive/folders/1GIeoWWMbZO02idPNlVVQLJWLWbtespiX?usp=sharing

In [5]:
import numpy as np
import random
from sklearn import linear_model
import pandas as pd
import itertools
import pickle

### Data Preparation
As always, we start loading our data and inspecting what might we find in it. 
In this scenario, and using a fixed number of players (2-3-4), we are gonna load two datasets: 
1. All possible combinations given the desired number of players, including players and team names, as well as basic offensive and defensive stats. 
2. Feature vectors that define each player with 10 characteristics. These vectors are the ones detailed in the previous Notebook of Classification; as mentioned, these features include (1) 3-point Proficiency, (2) Offensive Load, (3) True Shooting percentage, (4) FTA/FGA, (5) Offensive and (6) Defensive Rebouding Percentage, (7) Assist Ratio, (8) Turnover Ratio, (9) Steal Ratio, and (10) (2FG/2FGA) of the opponent team. 

In [6]:
# Declare number of players and features to be used 
nPlayers = 4 # Could be changed to 2 - 3 - 4 - 5
nFeats = 10
# Decide minimum number of possessions on court of each combination of players
if nPlayers == 2:
    minPoss = 500
elif nPlayers == 3:
    minPoss = 250
elif nPlayers == 4:
    minPoss = 200
elif nPlayers == 5:
    minPoss = 100

# Load matrices
sFolder = '/Users/arbues/Documents/UCAM/Euroleague-Notebooks/Data/'
# Lineup combinations of nPlayers
if nPlayers!=4:
    allLineups = np.load(sFolder + "Lineup-Combos/PermPlayers" + str(nPlayers) + ".npy", allow_pickle=True)
else:
    allLineups1 = np.load(sFolder + "Lineup-Combos/PermPlayers" + str(nPlayers) + "a.npy", allow_pickle=True)
    allLineups2 = np.load(sFolder + "Lineup-Combos/PermPlayers" + str(nPlayers) + "b.npy", allow_pickle=True)
    allLineups = np.concatenate([allLineups1,allLineups2])
# Corresponding teams
allTeams = np.load(sFolder + "Lineup-Combos/PermTeams" + str(nPlayers) + ".npy", allow_pickle=True)
# Corresponding offensive stats
# Input vec: [0] 2FGM, [1] 2FGA, [2] 3FGM, [3] 3FGA, [4] FTM, [5] FTA, [6] DReb, [7] OReb, [8] Ast, [9] 'null', [10] TOV
basicStatsOf = np.load(sFolder + "Lineup-Combos/basicStatsOf" + str(nPlayers) + ".npy", allow_pickle=True)
# Corresponding defensive stats
basicStatsDef = np.load(sFolder + "Lineup-Combos/basicStatsDef" + str(nPlayers) + ".npy", allow_pickle=True)
# Feature vectors of all players (except 2019-2020)
featVecsInd = np.load(sFolder + "Lineup-Combos/FeatVecsIndN.npy", allow_pickle=True)
# Names of each feature vector
matNames = list(np.load(sFolder + "Lineup-Combos/FeatVecsIndNames.npy", allow_pickle=True))

Let's inspect once again what type of data are we dealing with here (collective lineups + individual feature vectors): 

In [7]:
print(matNames[0])

iPosAux = 0
print('Players: ' + str(allLineups[iPosAux]), ' | Team: ' + str(allTeams[iPosAux]))
print('Offensive Stats: ' + str(basicStatsOf))
print('Defensive Stats: ' + str(basicStatsDef))
for iP in range(0, nPlayers):
    for iYear in range(2007, 2019):
        namePlayer = allLineups[iPosAux][iP] + ' ' + str(iYear) + ' ' + str(allTeams[iPosAux])
        try:
            print('Feature Vector of: ' + namePlayer + ': ' + str(featVecsInd[matNames.index(namePlayer)]))
        except:
            pass

JACKSON, MARC 2007 OLY
Players: ['GRANGER, JAYSON' 'HUERTAS, MARCELINHO' 'MCRAE, JORDAN'
 'SHENGELIA, TORNIKE']  | Team: BAS
Offensive Stats: [[0. 1. 0. ... 1. 1. 0.]
 [0. 0. 0. ... 2. 1. 0.]
 [6. 6. 4. ... 5. 4. 0.]
 ...
 [1. 0. 0. ... 1. 2. 0.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Defensive Stats: [[ 1.  1.  0. ...  1.  1.  0.]
 [ 1.  1.  1. ...  1.  2.  0.]
 [ 6. 10.  3. ...  3.  5.  0.]
 ...
 [ 0.  1.  0. ...  2.  2.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]]
Feature Vector of: GRANGER, JAYSON 2017 BAS: [0.38732658 0.44316652 0.5433914  0.15811966 0.01574803 0.26395939
 0.25972703 0.04558301 0.01028807 0.47154472]
Feature Vector of: HUERTAS, MARCELINHO 2010 BAS: [0.39960751 0.59725617 0.56441327 0.32846715 0.04234528 0.27027027
 0.35289747 0.06729081 0.01393612 0.45652174]
Feature Vector of: HUERTAS, MARCELINHO 2017 BAS: [0.4363905  0.39564912 0.53566762 0.23021583 0.0228013  0.14556962
 0.26046884 0.04635219 0.00382117 0.40625   ]
Feature Vec

However, as you all know, using basic stats might fall short in several scenarios, so the above-displayed digits should be converted into some kind of advanced metrics. In the following function, the computation of some gold-standard state-of-the-art metrics are provided. 

In [8]:
# Switch from basic to advanced stats
def basic2advancedStats(matrix,lineups,matrixAux,minPoss,teams):
    # Possessions
    possessions = 0.96*(matrix[:,0]+matrix[:,1]+matrix[:,2]+matrix[:,3]+matrix[:,10]+0.44*(matrix[:,4]+matrix[:,5])-matrix[:,7])
    # Filter by minimum possessions
    possessionsVal = possessions>minPoss
    possessions = possessions[possessionsVal]
    lineups = lineups[possessionsVal]
    teams = teams[possessionsVal]
    matrix = matrix[possessionsVal]
    matrixAux = matrixAux[possessionsVal]
    # OER-DER
    OER = (100*(2*matrix[:,0]+3*matrix[:,2]+matrix[:,4]))/possessions
    DER = (100*(2*matrixAux[:,0]+3*matrixAux[:,2]+matrixAux[:,4]))/possessions
    # Assist Ratio (you can switch from percentage of assisted shots or ratio per possession)
    # ASTr = (matrix[:,8]*100)/(matrix[:,0]+matrix[:,1]+matrix[:,2]+matrix[:,3]+matrix[:,10]+0.44*(matrix[:,4]+matrix[:,5])+matrix[:,8])
    ASTr = (matrix[:,8]*100)/(matrix[:,0]+matrix[:,2]+0.44*(matrix[:,4]))
    # Offensive Rebounding Ratio
    OReb = (matrix[:,7]*100)/(matrix[:,7]+matrixAux[:,6])
    # Defensive Rebounding Ratio
    DReb = (matrix[:,6]*100)/(matrix[:,6]+matrixAux[:,7])
    # Turnover Ratio
    TOV = (matrix[:,10]*100)/(matrix[:,0]+matrix[:,1]+matrix[:,2]+matrix[:,3]+matrix[:,10]+0.44*(matrix[:,4]+matrix[:,5])+matrix[:,8])
    # Effective Field Goal Percentage
    eFG = (matrix[:,0]+matrix[:,2]+0.5*matrix[:,2])/(matrix[:,0]+matrix[:,1]+matrix[:,2]+matrix[:,3])
    return possessions, OER, DER, ASTr, OReb, DReb, TOV, eFG*100, lineups, possessionsVal, teams

# Obtain advanced stats 
possessions, OER, DER, ASTr, OReb, DReb, TOV, eFG, lineups, possessionsVal, teams = basic2advancedStats(basicStatsOf, allLineups, basicStatsDef, minPoss,allTeams)
NET = OER-DER

At this point we have: lineup combinations, and individual feature vectors. Hence, the next step is to find the feature vectors corresponding to each player in every combination, and concatenate them to have a single feature vector / combo. However, there's one main limitation: individual feature vectors include season data, while combinations just add together all seasons in one; in these scenarios, which are not repeated a lot (apart from Spanoulis-Printezis combos et al.), we'll pick a random season of the given sample. At the same time, apart from storing features from players, we are also gathering the performance of the lineup in terms of the advanced statistic to be predicted (OER in the default example). 

Finally, another aspect to be remarked is that once we have one feature vector per player, when concatenating, we'll sort the vector by their "offensive load" value. In this way, we'll always have the "most important offensive player" in the first 10 positions of the array, the second one in the positions from 10 to 20, and so on; it makes sense to concatenate vector by using a non-random pattern, in order to make better predictions a posteriori.

In [9]:
# Define year range
yearBeg = 2007
yearFin = 2019

# Decide which sample is gonna be regressed (could be any advanced stat from the previous ones)
statSample = np.copy(OER)
statString = 'OER'

# Initialize and build feature vectors
featVecs = []
finStat = []
finLineups = []
for iL in range(0, len(lineups)):
    # For each combo, we are gonna check which years did those players play together in one same team
    allP = []
    for iP in range(0, len(lineups[iL])):
        allPi = []
        for iYear in range(yearBeg, yearFin):
            # Search with Player Name + Season + Current Team
            strName = lineups[iL][iP] + ' ' + str(iYear) + ' ' + teams[iL]
            if strName in matNames:
                # If the player exists, append index, otherwise -1
                allPi.append(matNames.index(lineups[iL][iP] + ' ' + str(iYear) + ' ' + teams[iL]))
            else:
                allPi.append(-1)
        allP.append(allPi)
    allP = np.array(allP)
    
    # Check which possible years can be chosen (generally < 2)
    yearPoss = []
    try:
        for iYear in range(0, yearFin-yearBeg):
            # For each year, check if all players in a lineup played
            if np.sum(allP[:,iYear] > -1) == nPlayers:
                yearPoss.append(iYear)
        # Chose one of the random options and store the corresponding year
        finYear = random.choice(yearPoss)
        
        # Declare feature vector
        concVecs = []
        for iP in range(0, len(lineups[iL])):
            # Find the feature vector that belongs to each player plus season
            strName = lineups[iL][iP] + ' ' + str(finYear+yearBeg) + ' ' + teams[iL]
            concVecs.append(featVecsInd[matNames.index(strName)])
        # Sort vectors by offensive load
        posSort = np.argsort(np.array(concVecs)[:,1])
        posSort = posSort[::-1]
        # Final lineup feature vector containing all players
        featVecsi = np.zeros(nPlayers*nFeats,)
        for iP in range(0, len(posSort)):
            featVecsi[iP*nFeats:(iP+1)*nFeats] = concVecs[posSort[iP]]
        # Append all feature vectors - final stat to be regressed - lineups
        featVecs.append(np.array(featVecsi).flatten())
        finStat.append(statSample[iL])
        finLineups.append(lineups[iL])
    except:
        pass

### Predictive Model - Training
It's been a long ride, but once we got a single feature vector / lineup and an associated value (stat) for each lineup, we are able to train our model. Before getting started though, we'll do, once again the train-test split once shuffled the dataset. Note that we also store each lineup name in a single string for a better printing of variables. 

In [10]:
# Create the Shuffler
posAux = np.arange(len(featVecs))
np.random.shuffle(posAux)

# Shuffle dataset
featVecs = np.array(featVecs)[posAux]
finStat = np.array(finStat)[posAux]
finLineups = np.array(finLineups)[posAux]

# Split into train and test
partTrain = 0.8
posTrain = int(len(featVecs)*partTrain)
# Training Set
featVecs_train = featVecs[:posTrain]
namesTrain = finLineups[:posTrain]
statSample_train = finStat[:posTrain]
# Test Set
featVecs_test = featVecs[posTrain+1:]
namesTestAux = finLineups[posTrain+1:]
statSample_test = finStat[posTrain+1:]
# Convert lineup combinations in single concatenated strings
namesTest = []
for iName in range(0, len(namesTestAux)):
    nameAux = ''
    for iP in range(0, nPlayers):
        nameAux = nameAux + str(namesTestAux[iName][iP].split(',')[0]) + ' '
    nameAux = nameAux[:-1]
    namesTest.append(nameAux)

As happened with the case of Classification, the library Scikit-Learn has several options when it comes to predictive linear models (https://scikit-learn.org/stable/modules/linear_model.html); in the given example, we train by using a Linear Regression model, but please note that this is not the only option. 

In [11]:
# Initialize Model 
regr = linear_model.LinearRegression()
## Other possibilities
#regr = linear_model.LassoLars()
#regr = linear_model.SGDRegressor()
#regr = svm.SVR()

### Predictive Model - Testing
Once trained, we will use the testing samples to assess the performance of the model. Normally, the coefficients of the model are analyzed together with the computation of MSE, but since we are strictly talking about basketball, we can also use metrics such as Mean Absolute Error or Median Absolute Error, that will express the results in "basketball language" units (same statistic we are predicting -- points/100 possessions, assists, rebounds...).

In [12]:
# Train Model 
regr.fit(featVecs_train, statSample_train)
## In case you want to save the trained model
#pickle.dump(regr, open(sFolder + 'Models/' + statString + str(nPlayers) + '.pkl', 'wb'))

# Test Model
Stat_pred = regr.predict(featVecs_test)

## In case you want to inspect the obtained results, building a Pandas Dataframe might be useful
#matPred = [Stat_pred,statSample_test]
#columnsT = ['Predicted ' + statString,'Real ' + statString]
#matPredDF = pd.DataFrame(np.transpose(np.array(matPred)), index=namesTest, columns=columnsT)

print('Stat ' + str(statString))
print('Mean Absolute Error: %.2f' % np.mean(np.abs(statSample_test-Stat_pred)))
print('Median Absolute Error: %.2f' % np.median(np.abs(statSample_test-Stat_pred)))

Stat OER
Mean Absolute Error: 5.49
Median Absolute Error: 4.24


Having accuracy metrics is always nice, but... Once we know that the model is somewhat reliable, wouldn't it be much more cool to have a plug&play tool that predicts the performance of any given lineup? Ofc! 
In the following snippet of code, you can introduce the players you want to (name + year), and the model will do the rest for you. In some cases, you'll see that certain players might not be found; that might be due to a shooting threshold: in order to appear in the individual feature-vector dataset, the player must have attempted, at least, 100 shots. 

In [13]:
# For an empty group of players
featVecs_choice = []
# Avoid seasons
matNames_noT = [x[:-4] for x in matNames]

# Fill feature vectors with the desired players
for i in range(0, nPlayers):
    if i == 0:
        name = 'RODRIGUEZ, SERGIO'
        year = '2015'
    elif i == 1:
        name = 'NAVARRO, JUAN CARLOS'
        year = '2010'
    elif i == 2:
        name = 'SINGLETON, CHRIS'
        year = '2016'
    elif i == 3: 
        name = 'KAUN, SASHA'
        year = '2013'
    elif i == 4:
        name = 'DATOME, LUIGI'
        year = '2016'
    featVecs_choice.append(featVecsInd[matNames_noT.index(name.upper() + ' ' + year)])
# Reshape
featVecs_choice = np.reshape(np.array(featVecs_choice),(1,nPlayers*nFeats))
# Predict lineup performance
pred_out = regr.predict(featVecs_choice)
print('Predicted ' + statString + ': ' + str(pred_out[0]))

Predicted OER: 127.33999457749003


### Lineup Similarity 

Finally, another cool application would be the comparison with other similar lineups. By checking all possible combinations and by computing the absolute difference between lineups, the final results can be sorted in terms of similarity while displaying the predicted stats too; in this way, the user can also see the performance of similar lineups. A deep dive regarding this topic can be found in Todd Whitehead's talk in "Beyond the 4 Factors": https://www.youtube.com/watch?v=DKv-1n5OHEc&ab_channel=Adri%C3%A0Arbu%C3%A9s 

In [14]:
# Initialize vector of differences
diffTot = []
finLineups_flat = []
# Find all sorting combinations of players
if nPlayers == 2:
    pPerm = list(itertools.permutations([0, 1]))
elif nPlayers == 3:
    pPerm = list(itertools.permutations([0, 1, 2]))
elif nPlayers == 4:
    pPerm = list(itertools.permutations([0, 1, 2, 3]))
elif nPlayers == 5:
    pPerm = list(itertools.permutations([0, 1, 2, 3, 4]))

for iLin in range(0, len(finLineups)):
    difPermi = []
    # Find the absolute difference between lineups for all possible combinations
    for perm in range(0, len(pPerm)):
        featPerm = np.zeros((1,nPlayers*nFeats))
        for iP in range(0, len(pPerm[perm])):
            featPerm[0,iP*nFeats:(iP+1)*nFeats] = featVecs[iLin][pPerm[perm][iP]*nFeats:(pPerm[perm][iP]+1)*nFeats]
        difPermi.append(np.sum(np.abs(featPerm-featVecs_choice)))
    # Take the minimum difference
    diffTot.append(min(difPermi))

    # Get a single string/lineup
    nameAux = ''
    for iP in range(0, nPlayers):
        nameAux = nameAux + str(finLineups[iLin][iP].split(',')[0]) + ' '
    nameAux = nameAux[:-1]
    finLineups_flat.append(nameAux)

# Sort lineups, from minimum to maximum difference
diffTot_top, finStat_top, finLineups_top, diffTot = zip(*sorted(zip(np.array(diffTot), finStat, np.array(finLineups_flat), np.array(diffTot))))
sumDiffs = np.sum(featVecs_choice)

# Print the iSim first similar lineups
iSim = 10
print('Top ' + str(iSim) + ' Lineups')
for iS in range(0, iSim):
    print(str(iS+1) + '. ' +  finLineups_top[iS] + ' ' + statString + ': ' + str(finStat_top[iS]))


Top 10 Lineups
1. MELLI SLOUKAS VESELY WANAMAKER OER: 129.04655685197898
2. DE COLO HINES KURBANOV TEODOSIC OER: 148.40606029602347
3. LORBEK MICKEAL NAVARRO VAZQUEZ OER: 119.37836429935753
4. DAVIES MICIC PANGOS ULANOVAS OER: 120.59345014601584
5. DUNSTON LARKIN MICIC MOERMAN OER: 130.42391004415012
6. HIGGINS HINES KURBANOV TEODOSIC OER: 126.30319908597545
7. DE COLO KAUN TEODOSIC VORONTSEVICH OER: 123.74865361464869
8. DE COLO HINES JACKSON KURBANOV OER: 134.41713205844192
9. CLYBURN DE COLO HUNTER RODRIGUEZ OER: 131.0383606853836
10. CLYBURN DE COLO HINES RODRIGUEZ OER: 135.2153685991281


Hope you liked it! 
If you have any questions / suggestions, feel free to send me an email (adria.arbues@upf.edu) or a Twitter DM (@arbues6). 