Gabriel Marcelino
September 2024
Artificial Neural Network (ANN)

## Import dependencies and load data

In [138]:
import sys
print(sys.path)


['/Users/gabriel/Desktop/FALL2024/CST-435/Code/ANN-basketball', '/Library/Frameworks/Python.framework/Versions/3.11/lib/python311.zip', '/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11', '/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/lib-dynload', '', '/Users/gabriel/Library/Python/3.11/lib/python/site-packages', '/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages']


In [139]:
import csv
import random
import numpy as np
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

pool = []
training_data = []
# create pool with players from 2019-2022
with open('all_seasons.csv', mode = 'r') as file:
    csvFile = csv.reader(file)
    # ignore first line
    next(csvFile)
    for lines in csvFile:
        year = int(lines[21][:4])
        if 2018 < year < 2023 and len(pool) < 100 and lines[1] not in pool:
            pool.append(lines)
        elif len(training_data) < 5000:
            training_data.append(lines)


## Optimal Team : Considerations
For my optimal team, I will aim for:
- 1 or more players in top 20% of shooting percentage.
- Player in the top 5% out of the 100 for rebounds.
- Player with a Defensive Rebound percentage bigger than 0.2
- 3 or more players in top 20% of best net rating
- 2 or more players with better than average assists

In [140]:
def is_optimal_team(features, average_assists, top_20_ts, top_10_reb, top_20_rating):
        label = 0
    # check if there are 2 players with better than average assists
        players_ast = [player for player in features if float(player['ast']) > average_assists]
        if len(players_ast) >= 3:
            # check if 1 or more players in top 20% of shooting percentage
            players_ts = [player for player in features if float(player['ts_pct']) > top_20_ts]
            if len(players_ts) >= 1:
                # check if any player is in the top 10% for rebounds
                players_reb = [player for player in features if float(player['reb']) > 1]
                if len(players_reb) >=5:
                    # check if any player on team has dreb pct > 0.2
                    players_dreb = [player for player in features if float(player['dreb_pct'])>0.2]
                    if len(players_dreb) >=2:
                        # check if 3 or more players in top 20% of net rating
                        players_rating = [player for player in features if float(player['rating']) > top_20_rating]
                        if len(players_rating) >= 3:
                            # Optimal Team Found
                            label = 1
        return label



## Model

In [141]:
def extract_features(pool):
    features_list = []
    for player in pool:
        # extract relevant features based on considerations above
        features = {
            'name': player[1],
            'ts_pct': player[19],
            'reb': player[13],
            'dreb_pct': player[17],
            'rating': player[15],
            'ast': player[14]
        }
        features_list.append(features)
    return features_list


## Simulate Training Data to train model

In [142]:
def simulate_data(sample, num_iter=10000):
    X = []
    y = []
    # calculate average assists
    assists = np.array([float(player[14]) for player in sample])
    average_assists = np.mean(assists)
    # calculate top 20% of shooting percentage
    top_20_ts = np.percentile([float(player[19]) for player in sample], 80)
    # calculate top 10% of rebound
    top_10_reb = np.percentile([float(player[13]) for player in sample], 90)
    # calculate top 20% net rating
    top_20_rating = np.percentile([float(player[15]) for player in sample], 80)

    for i in range(num_iter):
        # select 5 random players from list
        selected_players = random.sample(sample, 5)
        features = extract_features(selected_players)
        X.append(features)
        """
        - 1 or more players in top 20% of shooting percentage.
        - Player in the top 10% for rebounds.
        - Player with a Defensive Rebound percentage bigger than 0.2
        - 3 or more players in top 20% of best net rating
        - 2 or more players with better than average assists
        """
        label = is_optimal_team(features, average_assists, top_20_ts, top_10_reb, top_20_rating)
        
        y.append(label)                   
    X = np.array(X)
    y = np.array(y)
    
    # output shapes
    print(f"Shape of X: {X.shape}")
    print(f"Shape of y: {y.shape}")
    
    # Output number of y = 1 occurances and output
    count_ones = np.sum(y == 1)
    print(f"Number of 1's in y: {count_ones}")

    return X, y




## Train Model
Now that we have the training data, we can train the model.

In [143]:
X_train, y_train = simulate_data(training_data)

# Extract numerical features from dictionaries
X_train = np.array([
    [
        float(player['ts_pct']),
        float(player['reb']),
        float(player['dreb_pct']),
        float(player['rating']),
        float(player['ast'])
    ]
    for team in X_train
    for player in team
]).reshape(len(X_train), -1)
                    
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.5, random_state=42)

# Build the model
# Define the neural network architecture
# Define the neural network architecture
input_size = X_train.shape[1]
hidden_size = 64
output_size = 1
learning_rate = 0.01
epochs = 1000

# Initialize weights and biases
np.random.seed(42)
W1 = np.random.randn(input_size, hidden_size) * 0.01
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * 0.01
b2 = np.zeros((1, output_size))

# Activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Forward propagation
def forward_propagation(X):
    Z1 = np.dot(X, W1) + b1
    A1 = sigmoid(Z1)
    Z2 = np.dot(A1, W2) + b2
    A2 = sigmoid(Z2)
    return Z1, A1, Z2, A2

# Compute the cost
def compute_cost(A2, y):
    m = y.shape[0]
    cost = -np.sum(y * np.log(A2) + (1 - y) * np.log(1 - A2)) / m
    return cost

# Backpropagation with gradient clipping
def backward_propagation(X, y, Z1, A1, Z2, A2):
    m = y.shape[0]
    dZ2 = A2 - y.reshape(-1, 1)
    dW2 = np.dot(A1.T, dZ2) / m
    db2 = np.sum(dZ2, axis=0, keepdims=True) / m
    dA1 = np.dot(dZ2, W2.T)
    dZ1 = dA1 * sigmoid_derivative(A1)
    dW1 = np.dot(X.T, dZ1) / m
    db1 = np.sum(dZ1, axis=0, keepdims=True) / m

    # Gradient clipping
    clip_value = 1.0
    dW1 = np.clip(dW1, -clip_value, clip_value)
    db1 = np.clip(db1, -clip_value, clip_value)
    dW2 = np.clip(dW2, -clip_value, clip_value)
    db2 = np.clip(db2, -clip_value, clip_value)

    return dW1, db1, dW2, db2

# Update weights and biases
def update_parameters(dW1, db1, dW2, db2):
    global W1, b1, W2, b2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2

# Train the model
for epoch in range(epochs):
    Z1, A1, Z2, A2 = forward_propagation(X_train)
    cost = compute_cost(A2, y_train)
    dW1, db1, dW2, db2 = backward_propagation(X_train, y_train, Z1, A1, Z2, A2)
    update_parameters(dW1, db1, dW2, db2)
    if epoch % 100 == 0:
        print(f'Epoch {epoch}, Cost: {cost}')

# Evaluate the model
_, _, _, A2_test = forward_propagation(X_test)
y_pred = (A2_test > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Shape of X: (10000, 5)
Shape of y: (10000,)
Number of 1's in y: 28
Epoch 0, Cost: 3626.926484282593
Epoch 100, Cost: 328.27650163102925
Epoch 200, Cost: 190.81517448884736
Epoch 300, Cost: 151.98975517929458


## Explain your architecture and how the basketball player characteristics are used as inputs:

1. Data Preparation:

- Pool and Training Data:
    - Separates data: 
        - `pool` holds 100 unique players (2019-2022) for potential future use.
        - `training_data` stores up to 5000 players (2019-2022) for model training.

2. Feature Extraction:

- The `extract_features` function takes a list of players and creates a dictionary of features for each one. 
- These features include:
    - Name (not used as input)
    - True Shooting Percentage (ts_pct)
    - Rebounds (reb)
    - Defensive Rebound Percentage (dreb_pct)
    - Net Rating (rating)
    - Assists (ast)

3. Simulating Data for Training:

- The `simulate_data` function generates training data with labels indicating successful teams based on pre-defined criteria.
- It takes a list of players (`sample`) and a number of simulations (`num_iter`).
- Here's how player characteristics are used as inputs:
    - Averages and percentiles are calculated for assists, shooting percentage, rebounds, and net rating from the `sample` data.
    - Loops through `num_iter` simulations:
        - Selects 5 random players.
        - Extracts their features using `extract_features`.
        - Assigns a label (0 or 1) based on the criteria defined above..

4. Model Building and Training:

TO-DO
5. Evaluation:

TO-DO


## Interpret the output of your MLP in the context of selecting an optimal basketball team:

This MLP is trained on the 'optimal' prediction, given the input features of the dataset-shooting percentage, rebounds, assists, and player ratings amongst others-a basketball lineup is optimal. These mentioned inputs provide key indications about player performance. The model will process it for a binary outcome; 1 stands for an optimum team, and 0 stands for a suboptimum team. It could be shown that the MLP model is currently trained to learn from historical data to understand whether a combination of players is strong or weak and thus provide insight into which lineups will perform likely perform.

These could then inform the choice of an optimum basketball team, where the model shows which sets of players are statistically balanced in key performance metrics. The teams that are predicted as "optimal" will indicate that their set of players is effective, while those labeled "suboptimal" need adjustment. Coaches or team selectors would be able to play around with different mixes of players in order to fine-tune their strategy and ensure the team is great in a lot of aspects of the game: scoring, defense, teamwork, just to name a few.

Besides the precision, recall is one of the metrics that signifies model reliability or accuracy. High accuracy means the model can discriminate between good and bad lineups well. While precision and recall give an indication of few false predictions or missing optimal teams, in the end, what an MLP model provides is a data-driven method of selection of the team so as to best ensure that the lineup chosen will perform better on the court.