Gabriel Marcelino, Grant Burk, and Eli Kaustinen
September 2024
Artificial Neural Network (ANN)

## Import dependencies and load data

In [141]:
import sys
print(sys.path)


['C:\\Program Files\\JetBrains\\PyCharm 2023.3.3\\plugins\\python\\helpers-pro\\jupyter_debug', 'C:\\Program Files\\JetBrains\\PyCharm 2023.3.3\\plugins\\python\\helpers\\pydev', 'C:\\Users\\grant\\PycharmProjects\\neural-networks', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\DLLs', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312', '', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\win32', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\win32\\lib', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\Pythonwin', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\setuptools\\_vendor']


In [142]:
import csv
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

pool = []
training_data = []
# create pool with players from 2019-2022
with open('all_seasons.csv', mode = 'r') as file:
    csvFile = csv.reader(file)
    # ignore first line
    next(csvFile)
    for lines in csvFile:
        training_data.append(lines)


## Optimal Team : Considerations
For my optimal team, I will aim for:
- 1 or more players in top 20% of shooting percentage.
- Player in the top 5% out of the 100 for rebounds.
- Player with a Defensive Rebound percentage bigger than 0.2
- 3 or more players in top 20% of best net rating
- 2 or more players with better than average assists

In [143]:
def extract_features(player):
    return {
        'name': player[1],
        'ts_pct': float(player[19]),
        'reb': float(player[13]),
        'dreb_pct': float(player[17]),
        'rating': float(player[15]),
        'ast': float(player[14])
    }


In [144]:
def team_score(team):
    ts_pct = np.mean([player['ts_pct'] for player in team])
    reb = np.sum([player['reb'] for player in team])
    dreb_pct = np.max([player['dreb_pct'] for player in team])
    rating = np.mean([player['rating'] for player in team])
    ast = np.sum([player['ast'] for player in team])
    # For my optimal team, the most important features are
    # 1. True shooting percentage
    # 2. Defensive rebound percentage
    # 3. Rating
    # 4. Assists/Rebounds
    score = (ts_pct * 20 + reb * 0.5 + dreb_pct * 10 + rating + ast * 0.5) / 5
    return score

def select_optimal_team(pool, team_size=5):
    player_features = [extract_features(player) for player in pool]
    
    best_team = None
    best_score = -float('inf')

    # Use a heuristic to find a good solution
    for _ in range(1000):  # Number of iterations
        team = random.sample(player_features, team_size)
        score = team_score(team)
        if score > best_score:
            best_score = score
            best_team = team

    return best_team, best_score





## Model

In [145]:
def extract_features(pool):
    features_list = []
    for player in pool:
        # extract relevant features based on considerations above
        features = {
            'name': player[1],
            'ts_pct': player[19],
            'reb': player[13],
            'dreb_pct': player[17],
            'rating': player[15],
            'ast': player[14]
        }
        features_list.append(features)
    return features_list


## Simulate Training Data to train model

In [146]:
def make_empty(player):
    player['ts_pct'] = 0
    player['reb'] = 0
    player['dreb_pct'] = 0
    player['rating'] = 0
    player['ast'] = 0
    return player

# Get Best Player for Category

In [147]:
def get_best(players, category):
    best_index = 0
    for cur in range(len(players)):
        if float(players[cur][category]) > float(players[best_index][category]):
            best_index = cur
    return best_index

# Pre Process the data

In [148]:
def get_datapool(segment):
    # Make testing data sets
    Xall = extract_features(training_data)
    
    # Create one training set
    X1 = []
    for i in range(100):
        X1.append(Xall[i + segment * 100])
    
    tempX1 = X1.copy()
    
    # Create empty expected Y array to fill later
    Y1 = np.empty_like(X1)
    for i in range(len(Y1)):
        Y1[i] = 0
    
    # Best ts_pct into Y1
    cur = get_best(tempX1, 'ts_pct')
    tempX1[cur] = make_empty(tempX1[cur]) # Prevent repeat
    Y1[cur] = 1
    
    # Best reb into Y1
    cur = get_best(tempX1, 'reb')
    tempX1[cur] = make_empty(tempX1[cur]) # Prevent repeat
    Y1[cur] = 1
    
    # Best dreb_pct into Y1
    cur = get_best(tempX1, 'dreb_pct')
    tempX1[cur] = make_empty(tempX1[cur]) # Prevent repeat
    Y1[cur] = 1
    
    # Best rating into Y1
    cur = get_best(tempX1, 'rating')
    tempX1[cur] = make_empty(tempX1[cur]) # Prevent repeat
    Y1[cur] = 1
    
    # Best ast into Y1
    cur = get_best(tempX1, 'ast')
    tempX1[cur] = make_empty(tempX1[cur]) # Prevent repeat
    Y1[cur] = 1
    
    print()
    print("tempX1")
    print(len(tempX1))
    print(tempX1)
    
    return X1, Y1

X2, Y2 = get_datapool(1)

print("X2")
print(len(X2))
print(X2)
print()
print("Y2")
print(len(Y2))
print(Y2)

X_train, y_train = get_datapool(1)
X_test, y_test = get_datapool(2)

X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)


tempX1
100
[{'name': 'Chris Mills', 'ts_pct': '0.544', 'reb': '6.2', 'dreb_pct': '0.157', 'rating': '4.5', 'ast': '2.5'}, {'name': 'Chris Morris', 'ts_pct': '0.486', 'reb': '2.2', 'dreb_pct': '0.155', 'rating': '-7.3', 'ast': '0.6'}, {'name': 'Chris Mullin', 'ts_pct': 0, 'reb': 0, 'dreb_pct': 0, 'rating': 0, 'ast': 0}, {'name': 'Chris Robinson', 'ts_pct': '0.486', 'reb': '1.7', 'dreb_pct': '0.08800000000000001', 'rating': '-11.4', 'ast': '1.6'}, {'name': 'Chris Webber', 'ts_pct': '0.5539999999999999', 'reb': '10.3', 'dreb_pct': '0.207', 'rating': '3.4', 'ast': '4.6'}, {'name': 'Chris Whitney', 'ts_pct': '0.5660000000000001', 'reb': '1.3', 'dreb_pct': '0.09300000000000001', 'rating': '2.0', 'ast': '2.2'}, {'name': 'Christian Laettner', 'ts_pct': '0.562', 'reb': '8.8', 'dreb_pct': '0.18600000000000003', 'rating': '11.0', 'ast': '2.7'}, {'name': 'Chucky Brown', 'ts_pct': '0.552', 'reb': '2.1', 'dreb_pct': '0.177', 'rating': '-5.9', 'ast': '0.4'}, {'name': 'Chris Gatling', 'ts_pct': '0.58

## Train Model
Now that we have the training data, we can train the model.

In [149]:
# Build the model
# Define the neural network architecture
# Define the neural network architecture
input_size = X_train.shape[1]
hidden_size = 64
output_size = 1
learning_rate = 0.01
epochs = 1000

# Initialize weights and biases
np.random.seed(42)
W1 = np.random.randn(input_size, hidden_size) * 0.01
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * 0.01
b2 = np.zeros((1, output_size))

# Activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Forward propagation
def forward_propagation(X):
    Z1 = np.dot(X, W1) + b1
    A1 = sigmoid(Z1)
    Z2 = np.dot(A1, W2) + b2
    A2 = sigmoid(Z2)
    return Z1, A1, Z2, A2

# Compute the cost
def compute_cost(A2, y):
    m = y.shape[0]
    cost = -np.sum(y * np.log(A2) + (1 - y) * np.log(1 - A2)) / m
    return cost

# Backpropagation with gradient clipping
def backward_propagation(X, y, Z1, A1, Z2, A2):
    m = y.shape[0]
    dZ2 = A2 - y.reshape(-1, 1)
    dW2 = np.dot(A1.T, dZ2) / m
    db2 = np.sum(dZ2, axis=0, keepdims=True) / m
    dA1 = np.dot(dZ2, W2.T)
    dZ1 = dA1 * sigmoid_derivative(A1)
    dW1 = np.dot(X.T, dZ1) / m
    db1 = np.sum(dZ1, axis=0, keepdims=True) / m

    # Gradient clipping
    clip_value = 1.0
    dW1 = np.clip(dW1, -clip_value, clip_value)
    db1 = np.clip(db1, -clip_value, clip_value)
    dW2 = np.clip(dW2, -clip_value, clip_value)
    db2 = np.clip(db2, -clip_value, clip_value)

    return dW1, db1, dW2, db2

# Update weights and biases
def update_parameters(dW1, db1, dW2, db2):
    global W1, b1, W2, b2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2

# Train the model
for epoch in range(epochs):
    Z1, A1, Z2, A2 = forward_propagation(X_train)
    cost = compute_cost(A2, y_train)
    dW1, db1, dW2, db2 = backward_propagation(X_train, y_train, Z1, A1, Z2, A2)
    update_parameters(dW1, db1, dW2, db2)
    if epoch % 100 == 0:
        print(f'Epoch {epoch}, Cost: {cost}')

# Evaluate the model
_, _, _, A2_test = forward_propagation(X_test)
y_pred = (A2_test > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

IndexError: tuple index out of range

## Explain your architecture and how the basketball player characteristics are used as inputs:

1. Data Preparation:

- Pool and Training Data:
    - Separates data: 
        - `pool` holds 100 unique players (2019-2022) for potential future use.
        - `training_data` stores up to 5000 players (2019-2022) for model training.

2. Feature Extraction:

- The `extract_features` function takes a list of players and creates a dictionary of features for each one. 
- These features include:
    - Name (not used as input)
    - True Shooting Percentage (ts_pct)
    - Rebounds (reb)
    - Defensive Rebound Percentage (dreb_pct)
    - Net Rating (rating)
    - Assists (ast)

3. Simulating Data for Training:

- The `simulate_data` function generates training data with labels indicating successful teams based on pre-defined criteria.
- It takes a list of players (`sample`) and a number of simulations (`num_iter`).
- Here's how player characteristics are used as inputs:
    - Averages and percentiles are calculated for assists, shooting percentage, rebounds, and net rating from the `sample` data.
    - Loops through `num_iter` simulations:
        - Selects 5 random players.
        - Extracts their features using `extract_features`.
        - Assigns a label (0 or 1) based on the criteria defined above..

4. Model Building and Training:

TO-DO
5. Evaluation:

TO-DO


## Interpret the output of your MLP in the context of selecting an optimal basketball team:

This MLP is trained on the 'optimal' prediction, given the input features of the dataset-shooting percentage, rebounds, assists, and player ratings amongst others-a basketball lineup is optimal. These mentioned inputs provide key indications about player performance. The model will process it for a binary outcome; 1 stands for an optimum team, and 0 stands for a suboptimum team. It could be shown that the MLP model is currently trained to learn from historical data to understand whether a combination of players is strong or weak and thus provide insight into which lineups will perform likely perform.

These could then inform the choice of an optimum basketball team, where the model shows which sets of players are statistically balanced in key performance metrics. The teams that are predicted as "optimal" will indicate that their set of players is effective, while those labeled "suboptimal" need adjustment. Coaches or team selectors would be able to play around with different mixes of players in order to fine-tune their strategy and ensure the team is great in a lot of aspects of the game: scoring, defense, teamwork, just to name a few.

Besides the precision, recall is one of the metrics that signifies model reliability or accuracy. High accuracy means the model can discriminate between good and bad lineups well. While precision and recall give an indication of few false predictions or missing optimal teams, in the end, what an MLP model provides is a data-driven method of selection of the team so as to best ensure that the lineup chosen will perform better on the court.