Gabriel Marcelino
September 2024
Artificial Neural Network (ANN)

## Import dependencies and load data

In [11]:
import sys
print(sys.path)


['C:\\Program Files\\JetBrains\\PyCharm 2023.3.3\\plugins\\python\\helpers-pro\\jupyter_debug', 'C:\\Program Files\\JetBrains\\PyCharm 2023.3.3\\plugins\\python\\helpers\\pydev', 'C:\\Users\\grant\\PycharmProjects\\neural-networks', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\DLLs', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312', '', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\win32', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\win32\\lib', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\Pythonwin', 'C:\\Users\\grant\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\setuptools\\_vendor']


In [12]:
import csv
import random
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

pool = []
training_data = []
# create pool with players from 2019-2022
with open('all_seasons.csv', mode = 'r') as file:
    csvFile = csv.reader(file)
    # ignore first line
    next(csvFile)
    for lines in csvFile:
        year = int(lines[21][:4])
        if 2018 < year < 2023 and len(pool) < 100 and lines[1] not in pool:
            pool.append(lines)
        elif len(training_data) < 5000:
            training_data.append(lines)


## Optimal Team : Considerations
For my optimal team, I will aim for:
- 2 or more players in top 20% of shooting percentage.
- Player in the top 5% out of the 100 for rebounds.
- Player with a Defensive Rebound percentage bigger than 0.2
- 3 or more players in top 20% of best net rating
- 2 or more players with better than average assists

## Model

In [13]:
def extract_features(pool):
    features_list = []
    for player in pool:
        # extract relevant features based on considerations above
        features = {
            'name': player[1],
            'ts_pct': player[19],
            'reb': player[13],
            'dreb_pct': player[17],
            'rating': player[15],
            'ast': player[14]
        }
        features_list.append(features)
    return features_list


## Simulate Training Data to train model

In [14]:


def simulate_data(sample, num_iter=10000):
    X = []
    y = []
    # calculate average assists
    assists = np.array([float(player[14]) for player in sample])
    average_assists = np.mean(assists)
    # calculate top 20% of shooting percentage
    top_20_ts = np.percentile([float(player[19]) for player in sample], 80)
    # calculate top 10% of rebound
    top_10_reb = np.percentile([float(player[13]) for player in sample], 90)
    # calculate top 20% net rating
    top_20_rating = np.percentile([float(player[15]) for player in sample], 80)

    for i in range(num_iter):
        # select 5 random players from list
        selected_players = random.sample(sample, 5)
        features = extract_features(selected_players)
        X.append(features)
        """
        - 2 or more players in top 20% of shooting percentage.
        - Player in the top 10% for rebounds.
        - Player with a Defensive Rebound percentage bigger than 0.2
        - 3 or more players in top 20% of best net rating
        - 2 or more players with better than average assists
        """
        label = 0
        # check if there are 2 players with better than average assists
        players_ast = [player for player in features if float(player['ast']) > average_assists]
        if len(players_ast) >= 3:
            # check if 2 or more players in top 20% of shooting percentage
            players_ts = [player for player in features if float(player['ts_pct']) > top_20_ts]
            if len(players_ts) >= 2:
                # check if any player is in the top 10% for rebounds
                players_reb = [player for player in features if float(player['reb']) > 1]
                if len(players_reb) >=5:
                    # check if any player on team has dreb pct > 0.2
                    players_dreb = [player for player in features if float(player['dreb_pct'])>0.2]
                    if len(players_dreb) >=2:
                        # check if 3 or more players in top 20% of net rating
                        players_rating = [player for player in features if float(player['rating']) > top_20_rating]
                        if len(players_rating) >= 3:
                            # Optimal Team Found
                            label = 1
        y.append(label)                   
    X = np.array(X)
    y = np.array(y)
    
    # output shapes
    print(f"Shape of X: {X.shape}")
    print(f"Shape of y: {y.shape}")
    
    # Output number of y = 1 occurances and output
    count_ones = np.sum(y == 1)
    print(f"Number of 1's in y: {count_ones}")

    return X, y




## Train Model
Now that we have the training data, we can train the model.

In [15]:
X_train, y_train = simulate_data(training_data)

# Extract numerical features from dictionaries
X_train = np.array([
    [
        float(player['ts_pct']),
        float(player['reb']),
        float(player['dreb_pct']),
        float(player['rating']),
        float(player['ast'])
    ]
    for team in X_train
    for player in team
]).reshape(len(X_train), -1)
                    
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.5, random_state=42)

# Build the model
model = Sequential()
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Getting a decimal place from the prediction function so this is an activation function to turn the decimal to binary
for i in range(len(y_pred)):
    if(y_pred[i] > 8):
        y_pred[i]= 1
    else:
        y_pred[i] = 0

# Helpful metrics to understanding the predictions made by the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Shape of X: (10000, 5)
Shape of y: (10000,)
Number of 1's in y: 16
Epoch 1/10


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 1ms/step - accuracy: 0.8662 - loss: 0.3167
Epoch 2/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9994 - loss: 0.0072
Epoch 3/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9983 - loss: 0.0162
Epoch 4/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9976 - loss: 0.0204
Epoch 5/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9994 - loss: 0.0051
Epoch 6/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9989 - loss: 0.0079
Epoch 7/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9992 - loss: 0.0054
Epoch 8/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9991 - loss: 0.0053
Epoch 9/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TO DO:

## Explain your architecture and how the basketball player characteristics are used as inputs:

1. Data Preparation:

- Pool and Training Data:
    - Separates data: 
        - `pool` holds 100 unique players (2019-2022) for potential future use.
        - `training_data` stores up to 5000 players (2019-2022) for model training.

2. Feature Extraction:

- The `extract_features` function takes a list of players and creates a dictionary of features for each one. 
- These features include:
    - Name (not used as input)
    - True Shooting Percentage (ts_pct)
    - Rebounds (reb)
    - Defensive Rebound Percentage (dreb_pct)
    - Net Rating (rating)
    - Assists (ast)

3. Simulating Data for Training:

- The `simulate_data` function generates training data with labels indicating successful teams based on pre-defined criteria.
- It takes a list of players (`sample`) and a number of simulations (`num_iter`).
- Here's how player characteristics are used as inputs:
    - Averages and percentiles are calculated for assists, shooting percentage, rebounds, and net rating from the `sample` data.
    - Loops through `num_iter` simulations:
        - Selects 5 random players.
        - Extracts their features using `extract_features`.
        - Assigns a label (0 or 1) based on these criteria:
            - At least 2 players with above-average assists (uses extracted assist data).
            - At least 2 players with top 20% shooting percentage (uses shooting percentage data).
            - At least 1 player with top 10% rebounds (uses rebound data).
            - At least 1 player with defensive rebound percentage > 0.2 (uses defensive rebound percentage data).
            - At least 3 players with top 20% net rating (uses net rating data).
    - This process essentially creates training examples where the input is a "team" of 5 players (represented by their features) and the label is 1 if it meets the criteria for a successful team.

4. Model Building and Training:

- Feature Extraction for Training Data:
    - Converts the simulated teams (lists of player features) into a single NumPy array suitable for the model. 
    - This array (X_train) holds the numerical features (shooting percentage, rebounds, etc.) for each player across all simulated teams.
- Splitting Data for Training and Testing:
    - The code now includes `train_test_split` to separate the data into training and testing sets (X_train, X_test, y_train, y_test). This allows for evaluating model performance on unseen data.
- Model Architecture:
    - A Sequential Neural Network is built with Keras.
    - It has two hidden layers with ReLU activation for non-linearity.
    - The output layer uses sigmoid activation for binary classification (successful team or not). 
- Model Training:
    - The model is trained on the prepared features (X_train) and labels (y_train) for 10 epochs with a batch size of 32.

5. Evaluation:

- The model makes predictions on the testing set (`X_test`) using `model.predict(X_test)`.
- The predicted labels (`y_pred`) are converted to binary (0 or 1) using a threshold of 0.5 (you can adjust this value). 
- Finally, various evaluation metrics like accuracy, precision, recall, and F1-score are calculated to assess model performance on the unseen testing data. 


## Interpret the output of your MLP in the context of selecting an optimal basketball team:

This MLP is trained on the 'optimal' prediction, given the input features of the dataset-shooting percentage, rebounds, assists, and player ratings amongst others-a basketball lineup is optimal. These mentioned inputs provide key indications about player performance. The model will process it for a binary outcome; 1 stands for an optimum team, and 0 stands for a suboptimum team. It could be shown that the MLP model is currently trained to learn from historical data to understand whether a combination of players is strong or weak and thus provide insight into which lineups will perform likely perform.

These could then inform the choice of an optimum basketball team, where the model shows which sets of players are statistically balanced in key performance metrics. The teams that are predicted as "optimal" will indicate that their set of players is effective, while those labeled "suboptimal" need adjustment. Coaches or team selectors would be able to play around with different mixes of players in order to fine-tune their strategy and ensure the team is great in a lot of aspects of the game: scoring, defense, teamwork, just to name a few.

Besides the precision, recall is one of the metrics that signifies model reliability or accuracy. High accuracy means the model can discriminate between good and bad lineups well. While precision and recall give an indication of few false predictions or missing optimal teams, in the end, what an MLP model provides is a data-driven method of selection of the team so as to best ensure that the lineup chosen will perform better on the court.