Gabriel Marcelino, Grant Burk, and Eli Kaustinen
September 2024
Artificial Neural Network (ANN)

## Import dependencies and load data

In [880]:
import sys
print(sys.path)


['C:\\Program Files\\JetBrains\\PyCharm 2023.3.3\\plugins\\python\\helpers-pro\\jupyter_debug', 'C:\\Program Files\\JetBrains\\PyCharm 2023.3.3\\plugins\\python\\helpers\\pydev', 'C:\\Users\\grant\\PycharmProjects\\neural-networks', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\DLLs', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312', '', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\win32', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\win32\\lib', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\Pythonwin', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\setuptools\\_vendor']


In [881]:
import csv
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
import tensorflow as tf
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

pool = []
training_data = []
# create pool with players from 2019-2022
with open('all_seasons.csv', mode = 'r') as file:
    csvFile = csv.reader(file)
    # ignore first line
    next(csvFile)
    for lines in csvFile:
        training_data.append(lines)


## Optimal Team : Considerations
For my optimal team, I will aim for:
- 1 or more players in top 20% of shooting percentage.
- Player in the top 5% out of the 100 for rebounds.
- Player with a Defensive Rebound percentage bigger than 0.2
- 3 or more players in top 20% of best net rating
- 2 or more players with better than average assists

In [882]:
def extract_features(player):
    return {
        'name': player[1],
        'ts_pct': float(player[19]),
        'reb': float(player[13]),
        'dreb_pct': float(player[17]),
        'rating': float(player[15]),
        'ast': float(player[14])
    }


In [883]:
def team_score(team):
    ts_pct = np.mean([player['ts_pct'] for player in team])
    reb = np.sum([player['reb'] for player in team])
    dreb_pct = np.max([player['dreb_pct'] for player in team])
    rating = np.mean([player['rating'] for player in team])
    ast = np.sum([player['ast'] for player in team])
    # For my optimal team, the most important features are
    # 1. True shooting percentage
    # 2. Defensive rebound percentage
    # 3. Rating
    # 4. Assists/Rebounds
    score = (ts_pct * 20 + reb * 0.5 + dreb_pct * 10 + rating + ast * 0.5) / 5
    return score

def select_optimal_team(pool, team_size=5):
    player_features = [extract_features(player) for player in pool]
    
    best_team = None
    best_score = -float('inf')

    # Use a heuristic to find a good solution
    for _ in range(1000):  # Number of iterations
        team = random.sample(player_features, team_size)
        score = team_score(team)
        if score > best_score:
            best_score = score
            best_team = team

    return best_team, best_score





## Model

In [884]:
def extract_features(pool):
    features_list = []
    for player in pool:
        # extract relevant features based on considerations above
        features = {
            'name': player[1],
            'ts_pct': player[19],
            'reb': player[13],
            'dreb_pct': player[17],
            'rating': player[15],
            'ast': player[14]
        }
        features_list.append(features)
    return features_list


## Simulate Training Data to train model

In [885]:
def make_empty(player):
    player['ts_pct'] = 0
    player['reb'] = 0
    player['dreb_pct'] = 0
    player['rating'] = 0
    player['ast'] = 0
    return player

# Get Best Player for Category

In [886]:
def get_best(players, category):
    best_index = 0
    for cur in range(len(players)):
        if float(players[cur][category]) > float(players[best_index][category]):
            best_index = cur
    return best_index

# Pre Process the data

In [887]:
def get_datapool(segment):
    # Make testing data sets
    Xall = extract_features(training_data)
    
    # Create one training set
    X1 = []
    for i in range(100):
        X1.append(Xall[i + segment * 100])
        
    Xval = []
    for player in X1:
        Xval.append(np.array([float(player['ts_pct']), float(player['reb'])/100, float(player['dreb_pct']), float(player['rating'])/100, float(player['ast'])/100])) # extract values + normalizations 
    Xval = np.array(Xval)
    
    tempX1 = []
    for player in X1:
        tempX1.append(player.copy())
    
    # Create empty expected Y array to fill later
    Y1 = np.empty_like(X1)
    for i in range(len(Y1)):
        Y1[i] = 0
    
    # Best ts_pct into Y1
    cur = get_best(tempX1, 'ts_pct')
    tempX1[cur] = make_empty(tempX1[cur]) # Prevent repeat
    Y1[cur] = 0.9
    
    # Best reb into Y1
    cur = get_best(tempX1, 'reb')
    tempX1[cur] = make_empty(tempX1[cur]) # Prevent repeat
    Y1[cur] = 0.9
    
    # Best dreb_pct into Y1
    cur = get_best(tempX1, 'dreb_pct')
    tempX1[cur] = make_empty(tempX1[cur]) # Prevent repeat
    Y1[cur] = 0.9
    
    # Best rating into Y1
    cur = get_best(tempX1, 'rating')
    tempX1[cur] = make_empty(tempX1[cur]) # Prevent repeat
    Y1[cur] = 0.9
    
    # Best ast into Y1
    cur = get_best(tempX1, 'ast')
    tempX1[cur] = make_empty(tempX1[cur]) # Prevent repeat
    Y1[cur] = 0.9
    
    Y1 = np.array(Y1)
    
    Y1temp =  tf.convert_to_tensor(Y1, dtype=tf.float64)
    
    Xvaltemp =  tf.convert_to_tensor(Xval, dtype=tf.float64)
    
    return X1, Xvaltemp, Y1temp # return raw X (X1), processed X (Xval), and Y value (Y1

X_raw_train, X_train, y_train = get_datapool(1)
X_raw_test, X_test, y_test = get_datapool(2)

# Print Player Pools

In [888]:
print("Training Team:")
print(X_raw_train)
print()
print("Testing Team:")
print(X_raw_test)

Training Team:
[{'name': 'Chris Mills', 'ts_pct': '0.544', 'reb': '6.2', 'dreb_pct': '0.157', 'rating': '4.5', 'ast': '2.5'}, {'name': 'Chris Morris', 'ts_pct': '0.486', 'reb': '2.2', 'dreb_pct': '0.155', 'rating': '-7.3', 'ast': '0.6'}, {'name': 'Chris Mullin', 'ts_pct': '0.645', 'reb': '4.0', 'dreb_pct': '0.106', 'rating': '-4.9', 'ast': '4.1'}, {'name': 'Chris Robinson', 'ts_pct': '0.486', 'reb': '1.7', 'dreb_pct': '0.08800000000000001', 'rating': '-11.4', 'ast': '1.6'}, {'name': 'Chris Webber', 'ts_pct': '0.5539999999999999', 'reb': '10.3', 'dreb_pct': '0.207', 'rating': '3.4', 'ast': '4.6'}, {'name': 'Chris Whitney', 'ts_pct': '0.5660000000000001', 'reb': '1.3', 'dreb_pct': '0.09300000000000001', 'rating': '2.0', 'ast': '2.2'}, {'name': 'Christian Laettner', 'ts_pct': '0.562', 'reb': '8.8', 'dreb_pct': '0.18600000000000003', 'rating': '11.0', 'ast': '2.7'}, {'name': 'Chucky Brown', 'ts_pct': '0.552', 'reb': '2.1', 'dreb_pct': '0.177', 'rating': '-5.9', 'ast': '0.4'}, {'name': 'Chr

## Train Model
Now that we have the training data, we can train the model.

In [889]:
# 3. Neural Network Model
model = tf.keras.models.Sequential([
    # Input Layer (implicitly present by specifying input_shape)
    
    # Hidden Layer 1
    tf.keras.layers.Dense(100, activation='relu'),
    
    # Hidden Layer 2
    tf.keras.layers.Dense(100, activation='relu'),
    
    # Output Layer
    tf.keras.layers.Dense(1, activation='sigmoid')  # Binary classification
])

# 4. Compile the Model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# 5. Train the Model
for i in range(10):
    for j in range(10):
        _, X_temp, y_temp = get_datapool(j)
        model.fit(X_temp, y_temp, epochs=10, batch_size=32)



Epoch 1/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9685 - loss: 0.6724
Epoch 2/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9477 - loss: 0.6310 
Epoch 3/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9446 - loss: 0.5922 
Epoch 4/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9383 - loss: 0.5542 
Epoch 5/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9665 - loss: 0.5026 
Epoch 6/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9602 - loss: 0.4627 
Epoch 7/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9508 - loss: 0.4239 
Epoch 8/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9477 - loss: 0.3838 
Epoch 9/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

# Print Team

In [890]:
# Training Data
print("Actual Best Team:")
for i in range(len(y_train)):
    if(y_train[i] > 0):
        print(X_raw_train[i])
print()

print("Predicted Best Team:")
predicted_y = model.predict(X_train)
for i in range(len(predicted_y)):
    if(predicted_y[i][0] > 0.5):
        print(X_raw_train[i]['name'])
        print(predicted_y[i])

Actual Best Team:
{'name': 'Chris Mullin', 'ts_pct': '0.645', 'reb': '4.0', 'dreb_pct': '0.106', 'rating': '-4.9', 'ast': '4.1'}
{'name': 'Bruce Bowen', 'ts_pct': '0.0', 'reb': '0.0', 'dreb_pct': '0.0', 'rating': '300.0', 'ast': '0.0'}
{'name': 'Charles Barkley', 'ts_pct': '0.581', 'reb': '13.5', 'dreb_pct': '0.28', 'rating': '7.5', 'ast': '4.7'}
{'name': 'Damon Stoudamire', 'ts_pct': '0.516', 'reb': '4.1', 'dreb_pct': '0.09', 'rating': '-3.7', 'ast': '8.8'}
{'name': 'Dennis Rodman', 'ts_pct': '0.479', 'reb': '16.1', 'dreb_pct': '0.32299999999999995', 'rating': '16.1', 'ast': '3.1'}

Predicted Best Team:
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
[[2.63088453e-03]
 [1.05120311e-03]
 [1.56185299e-01]
 [1.79899239e-03]
 [1.18748695e-01]
 [4.86175483e-03]
 [1.48781799e-02]
 [6.62057335e-03]
 [2.73287315e-02]
 [1.10956701e-02]
 [5.30578662e-03]
 [2.64622811e-02]
 [2.39031506e-03]
 [2.39231600e-03]
 [7.60078942e-03]
 [2.34725929e-04]
 [9.18812475e-06]
 [1.336350

In [891]:
# Testing Data
print("Actual Best Team:")
for i in range(len(y_test)):
    if(y_test[i] == 1):
        print(X_raw_test[i])
print()

print("Predicted Best Team:")
predicted_y = model.predict(X_test)
for i in range(len(predicted_y)):
    if(predicted_y[i][0] > 0.5):
        print(X_raw_test[i]['name'])
        print(predicted_y[i])


Actual Best Team:

Predicted Best Team:
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
[[6.11448288e-02]
 [5.69390925e-03]
 [1.91106438e-03]
 [3.71022907e-04]
 [1.00414455e-03]
 [7.58377719e-05]
 [6.00838423e-01]
 [1.47173870e-02]
 [8.62057041e-03]
 [4.32272954e-03]
 [8.53415504e-02]
 [3.59940901e-03]
 [2.12807558e-03]
 [2.62164921e-02]
 [1.16805942e-03]
 [1.43430904e-02]
 [1.73663336e-03]
 [1.44869238e-02]
 [1.93834465e-04]
 [2.51902803e-03]
 [1.35771499e-03]
 [3.82914254e-03]
 [2.36534569e-02]
 [1.63383680e-04]
 [2.02395469e-02]
 [3.04049288e-04]
 [6.82141818e-03]
 [4.97091748e-03]
 [1.21310016e-03]
 [8.19710121e-02]
 [1.19028585e-02]
 [1.17859431e-02]
 [2.62695044e-01]
 [3.43471183e-03]
 [1.42924367e-02]
 [1.71454549e-02]
 [5.20686572e-03]
 [6.33347023e-04]
 [8.68176576e-04]
 [1.15555739e-02]
 [1.18415081e-03]
 [1.62428396e-03]
 [4.31825174e-03]
 [6.74946308e-02]
 [3.29318387e-03]
 [3.52297490e-03]
 [7.30480393e-03]
 [5.40148467e-03]
 [1.55567691e-01]
 [3.20

## Explain your architecture and how the basketball player characteristics are used as inputs:

1. Data Preparation:

- Pool and Training Data:
    - Separates data: 
        - `pool` holds 100 unique players (2019-2022) for potential future use.
        - `training_data` stores up to 5000 players (2019-2022) for model training.

2. Feature Extraction:

- The `extract_features` function takes a list of players and creates a dictionary of features for each one. 
- These features include:
    - Name (not used as input)
    - True Shooting Percentage (ts_pct)
    - Rebounds (reb)
    - Defensive Rebound Percentage (dreb_pct)
    - Net Rating (rating)
    - Assists (ast)

3. Simulating Data for Training:

- The `simulate_data` function generates training data with labels indicating successful teams based on pre-defined criteria.
- It takes a list of players (`sample`) and a number of simulations (`num_iter`).
- Here's how player characteristics are used as inputs:
    - Averages and percentiles are calculated for assists, shooting percentage, rebounds, and net rating from the `sample` data.
    - Loops through `num_iter` simulations:
        - Selects 5 random players.
        - Extracts their features using `extract_features`.
        - Assigns a label (0 or 1) based on the criteria defined above..

4. Model Building and Training:

TO-DO
5. Evaluation:

TO-DO


## Interpret the output of your MLP in the context of selecting an optimal basketball team:

This MLP is trained on the 'optimal' prediction, given the input features of the dataset-shooting percentage, rebounds, assists, and player ratings amongst others-a basketball lineup is optimal. These mentioned inputs provide key indications about player performance. The model will process it for a binary outcome; 1 stands for an optimum team, and 0 stands for a suboptimum team. It could be shown that the MLP model is currently trained to learn from historical data to understand whether a combination of players is strong or weak and thus provide insight into which lineups will perform likely perform.

These could then inform the choice of an optimum basketball team, where the model shows which sets of players are statistically balanced in key performance metrics. The teams that are predicted as "optimal" will indicate that their set of players is effective, while those labeled "suboptimal" need adjustment. Coaches or team selectors would be able to play around with different mixes of players in order to fine-tune their strategy and ensure the team is great in a lot of aspects of the game: scoring, defense, teamwork, just to name a few.

Besides the precision, recall is one of the metrics that signifies model reliability or accuracy. High accuracy means the model can discriminate between good and bad lineups well. While precision and recall give an indication of few false predictions or missing optimal teams, in the end, what an MLP model provides is a data-driven method of selection of the team so as to best ensure that the lineup chosen will perform better on the court.