Gabriel Marcelino, Grant Burk, and Eli Kaustinen
September 2024
Artificial Neural Network (ANN)

## Import dependencies and load data

In [127]:
import sys
print(sys.path)


['/Users/gabriel/Desktop/FALL2024/CST-435/Code/ANN-basketball', '/Library/Frameworks/Python.framework/Versions/3.11/lib/python311.zip', '/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11', '/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/lib-dynload', '', '/Users/gabriel/Library/Python/3.11/lib/python/site-packages', '/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages']


In [128]:
import csv
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

pool = []
training_data = []
# create pool with players from 2019-2022
with open('all_seasons.csv', mode = 'r') as file:
    csvFile = csv.reader(file)
    # ignore first line
    next(csvFile)
    for lines in csvFile:
        training_data.append(lines)


## Optimal Team : Considerations
For my optimal team, I will aim for:
- 1 or more players in top 20% of shooting percentage.
- Player in the top 5% out of the 100 for rebounds.
- Player with a Defensive Rebound percentage bigger than 0.2
- 3 or more players in top 20% of best net rating
- 2 or more players with better than average assists

In [129]:
def extract_features(player):
    try:
        return {
            'name': player[1],
            'ts_pct': float(player[19]),
            'reb': float(player[13]),
            'dreb_pct': float(player[17]),
            'rating': float(player[15]),
            'ast': float(player[14])
        }
    except IndexError as e:
        print(f"IndexError: {e}")
        print(f"Player data: {player}")
        raise


In [130]:
def team_score(team):
    ts_pct = np.mean([player['ts_pct'] for player in team])
    reb = np.sum([player['reb'] for player in team])
    dreb_pct = np.max([player['dreb_pct'] for player in team])
    rating = np.mean([player['rating'] for player in team])
    ast = np.sum([player['ast'] for player in team])
    # For my optimal team, the most important features are
    # 1. True shooting percentage
    # 2. Defensive rebound percentage
    # 3. Rating
    # 4. Assists/Rebounds
    score = (ts_pct * 20 + reb * 0.5 + dreb_pct * 10 + rating + ast * 0.5) / 5
    return score

def select_optimal_team(pool, team_size=5):
    player_features = [extract_features(player) for player in pool]
    
    best_team = None
    best_score = -float('inf')

    # Use a heuristic to find a good solution
    for _ in range(1000):  # Number of iterations
        team = random.sample(player_features, team_size)
        score = team_score(team)
        if score > best_score:
            best_score = score
            best_team = team

    return best_team, best_score





## Model

In [131]:
# Function to extract features from a player
def extract_features(player):
    try:
        return {
            'name': player[1],
            'ts_pct': float(player[19]),
            'reb': float(player[13]),
            'dreb_pct': float(player[17]),
            'rating': float(player[15]),
            'ast': float(player[14])
        }
    except IndexError as e:
        print(f"IndexError: {e}")
        print(f"Player data: {player}")
        raise


## Simulate Training Data to train model

In [132]:
# Generate multiple pools of 100 players each and find the optimal team for each pool
num_pools = int(training_data[-1][21][:4]) - int(training_data[0][21][:4])
sample_size = 100
all_teams = []
all_labels = []

for _ in range(num_pools):
    pool_sample = random.sample(training_data, sample_size)
    optimal_team, _ = select_optimal_team(pool_sample)
    if optimal_team is not None:
        team_features = [player['ts_pct'] for player in optimal_team] + \
                        [player['reb'] for player in optimal_team] + \
                        [player['dreb_pct'] for player in optimal_team] + \
                        [player['rating'] for player in optimal_team] + \
                        [player['ast'] for player in optimal_team]
        all_teams.append(team_features)
        all_labels.append(1)  # Label for optimal team





## Train Model
Now that we have the training data, we can train the model.

In [133]:
# Convert to numpy arrays
X = np.array(all_teams)
y = np.array(all_labels)

# Normalize the features
X = (X - X.mean(axis=0)) / X.std(axis=0)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the model
# Define the neural network architecture
# Define the neural network architecture
input_size = X_train.shape[1]
hidden_size = 64
output_size = 1
learning_rate = 0.01
epochs = 1000

# Initialize weights and biases
np.random.seed(42)
W1 = np.random.randn(input_size, hidden_size) * 0.01
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * 0.01
b2 = np.zeros((1, output_size))

# Activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Forward propagation
def forward_propagation(X):
    Z1 = np.dot(X, W1) + b1
    A1 = sigmoid(Z1)
    Z2 = np.dot(A1, W2) + b2
    A2 = sigmoid(Z2)
    return Z1, A1, Z2, A2

# Compute the cost
def compute_cost(A2, y):
    m = y.shape[0]
    cost = -np.sum(y * np.log(A2) + (1 - y) * np.log(1 - A2)) / m
    return cost

# Backpropagation with gradient clipping
def backward_propagation(X, y, Z1, A1, Z2, A2):
    m = y.shape[0]
    dZ2 = A2 - y.reshape(-1, 1)
    dW2 = np.dot(A1.T, dZ2) / m
    db2 = np.sum(dZ2, axis=0, keepdims=True) / m
    dA1 = np.dot(dZ2, W2.T)
    dZ1 = dA1 * sigmoid_derivative(A1)
    dW1 = np.dot(X.T, dZ1) / m
    db1 = np.sum(dZ1, axis=0, keepdims=True) / m

    # Gradient clipping
    clip_value = 1.0
    dW1 = np.clip(dW1, -clip_value, clip_value)
    db1 = np.clip(db1, -clip_value, clip_value)
    dW2 = np.clip(dW2, -clip_value, clip_value)
    db2 = np.clip(db2, -clip_value, clip_value)

    return dW1, db1, dW2, db2

# Update weights and biases
def update_parameters(dW1, db1, dW2, db2):
    global W1, b1, W2, b2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2

# Train the model
for epoch in range(epochs):
    Z1, A1, Z2, A2 = forward_propagation(X_train)
    cost = compute_cost(A2, y_train)
    dW1, db1, dW2, db2 = backward_propagation(X_train, y_train, Z1, A1, Z2, A2)
    update_parameters(dW1, db1, dW2, db2)
    if epoch % 100 == 0:
        print(f'Epoch {epoch}, Cost: {cost}')

# Evaluate the model
_, _, _, A2_test = forward_propagation(X_test)
y_pred = (A2_test > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Epoch 0, Cost: 13.25374619713537
Epoch 100, Cost: 1.24291718759942
Epoch 200, Cost: 0.6150262170641626
Epoch 300, Cost: 0.40610025225946533
Epoch 400, Cost: 0.30251392234842756
Epoch 500, Cost: 0.24080479505645186
Epoch 600, Cost: 0.19990019197402328
Epoch 700, Cost: 0.17081762035089101
Epoch 800, Cost: 0.1490888260440433
Epoch 900, Cost: 0.1322427658192825
Accuracy: 1.0



## Model Training and Evaluation


In this section, we will train and evaluate our model using the training and testing datasets. The model is a simple neural network with one hidden layer. We will use forward propagation to compute the activations, backpropagation to compute the gradients, and gradient descent to update the weights and biases. The current accuracy is too high, which means that the model is overfitting, we are currently still looking into how to reduce overfitting. 



### Training Process

1. **Forward Propagation**: Compute the activations for the hidden layer and the output layer.
2. **Compute Cost**: Calculate the cost function to measure the performance of the model.
3. **Backpropagation**: Compute the gradients of the cost function with respect to the weights and biases.
4. **Update Parameters**: Update the weights and biases using gradient descent.
5. **Evaluate Model**: Calculate the accuracy of the model on the testing set.

The model is trained for 1000 epochs, and the cost is printed every 100 epochs. The final accuracy of the model on the testing set is also printed.

This process helps in selecting an optimal basketball team by evaluating different combinations of players based on their performance metrics.



## Interpret the output of your MLP in the context of selecting an optimal basketball team:

### Cost Function
The cost function measures how well the model's predictions match the actual labels. A lower cost indicates better performance. The significant reduction in cost over epochs shows that the model is effectively learning to predict the optimal team.

### Accuracy
Achieving an accuracy of 1.0 means the model perfectly predicts whether a given team is optimal based on the features provided. This suggests that the model has learned the patterns and relationships in the data very well.

### Potential Concerns
**Overfitting**: Perfect accuracy might indicate overfitting, where the model performs exceptionally well on the training and test data but may not generalize to new, unseen data. This can be mitigated by using techniques like cross-validation, adding noise, and regularization (which has already been added).

### Conclusion
The output shows that the MLP model has effectively learned to predict the optimal basketball team based on the provided features. The significant reduction in cost and perfect accuracy indicate that the model has captured the underlying patterns in the data. However, it's essential to ensure that the model generalizes well to new data to avoid overfitting.

