The project aims at developing a neural network to predict the Brownlow Votes in an AFL game.

The content mainly consists of mainly three parts, Data, Model, and Training/Evaluation

1. Data preparation

2. Neural networks as predictive models

3. Training and evaluation procedures


# Data preparation
This section mainly includes three parts

1.1 Load raw data from RawData.csv file using the default csv library

1.2 Outliers and noises removal

1.3 Create Dataset and Dataloader objects in PyTorch framework


## Load data

In [None]:
import csv

#load data and get basic information
with open('RawData.csv') as csv_file:
    csv_reader = csv.reader(csv_file)
    raw_data = []
    for row in csv_reader:
        raw_data.append(row)

attribute_names = raw_data[0]
attribute_names_to_idx = {n: i for i, n in enumerate(attribute_names)}
raw_data = raw_data[1:]

teams = []
for row in raw_data:
    if row[attribute_names_to_idx['Team']] not in teams:
        teams.append(row[2])
teams.sort()
teams_to_idx = {t: i for i, t in enumerate(teams)}

The raw data is organized into a dictionary for easy manipulation. There are totally $146607$ rows.

In [8]:
def group_data_by_games(data):
    games = {}
    game_count = 0
    prev_game = None
    for row in data:
        game = {'Date': row[attribute_names_to_idx['Date']],
                'Season': row[attribute_names_to_idx['Season']],
                'Round': row[attribute_names_to_idx['Round']],
                'Home Team': row[attribute_names_to_idx['Home Team']],
                'Away Team': row[attribute_names_to_idx['Away Team']],
                'Home Score': row[attribute_names_to_idx['Home Score']],
                'Away Score': row[attribute_names_to_idx['Away Score']],
                'Margin': row[attribute_names_to_idx['Margin']]}
        if prev_game is None or game != prev_game:
            if game != prev_game:
                game_count += 1
            games[game_count] = game.copy()
            games[game_count]['Stats'] = [row[10:-2] + [row[-1]]]
            games[game_count]['Brownlow Votes'] = [row[attribute_names_to_idx['Brownlow Votes']]]
            games[game_count]['Players'] = [row[attribute_names_to_idx['Name']]]
            games[game_count]['Teams'] = [row[attribute_names_to_idx['Team']]]
        else:
            games[game_count]['Stats'].append(row[10:-2] + [row[-1]])
            games[game_count]['Brownlow Votes'].append(row[attribute_names_to_idx['Brownlow Votes']])
            games[game_count]['Players'].append(row[attribute_names_to_idx['Name']])
            games[game_count]['Teams'].append(row[attribute_names_to_idx['Team']])
        prev_game = game
    return games


data_by_games = group_data_by_games(raw_data)

## Data Processing
Data processing, includes noise removal, feature selection, and normalization.

In [20]:
import numpy as np

def clean_data(data):
    """
    remove the noises and outliers, mainly including those matches of final series
    remove the noisy attributes
    :param data: a dict, output of data_by_games()
    :return:
    """
    new_data = data.copy()
    for i, d in list(new_data.items()):
        rd = d['Round']
        if len(d['Players']) != 44 or rd in ['GF', 'EF', 'QF', 'PF', 'SF']:
            new_data.pop(i)
    return new_data

def transform_strings_to_numbers(data):
    new_data = data.copy()
    for i, d in new_data.items():
        new_data[i]['Stats'] = np.array([list(map(float, dd)) for dd in d['Stats']])
        new_data[i]['Brownlow Votes'] = np.array(list(map(float, d['Brownlow Votes'])), dtype=np.int)
    return new_data

cleaned_data = clean_data(data_by_games)
transformed_data = transform_strings_to_numbers(cleaned_data)

In total, there are $3333$ games. After cleaning, there are totally $3177$ games. In the removed games, $3$ games have number of players fewer than 44, and other games are final serious.

In [22]:
def split_data_for_training_test(data: dict):
    train_data = {}
    test_data = {}
    val_data = {}
    train_count = 0
    test_count = 0
    for i, d in data.items():
        if float(d['Season']) <= 2015:
            train_data[train_count] = d
            train_count += 1
        else:
            test_data[test_count] = d
            test_count += 1

    val_count = 0
    for i in np.random.permutation(range(train_count))[:train_count // 5]:
        val_data[val_count] = train_data[i]
        val_count += 1
    return train_data, val_data, test_data


train_data, val_data, test_data = split_data_for_training_test(transformed_data)


Next, create Dataset and Dataloader objects for training and test as given in PyTorch tutorials.

### Normalization
Considering that the Brownlow Votes are voted in each game, i.e. only the players in the same game will competent against each other, so data normalization should be done on each game as well.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

def normalize_data(stats):
    means = np.mean(stats, 0)
    std = np.std(stats, 0)
    return (stats-means)/std

### Augmentation
A simple augmentation method is shuffling the orders of players' stats. A good predictive model should give the correct answer regardless of the ordering.

There are other augmentation methods, including random noises and random masks, in either stats or in targets. Due to time limitation, they are left to future work.

In [None]:
def data_augmentation(stats, votes):
    # shuffle the order of players
    idxes = range(stats.shape[0])
    idxes = np.random.permutation(idxes)
    return stats[idxes], votes[idxes]

### Implementation

The normalization and augmentation are implemented as functions that could be called for each data sample (each game) specifically.
Dataset and Dataloader provides multi-processing property,
so calling normalization and augmentation at this stage will be much more efficient.

## Features and Labels

### Features
The players status are used as basic features.

In addition, we could consider other information as features, such as team information, win/loss information, and even the player itself.
Generally, I think a player in winner team has a higher chance to get the votes than a player in the other team.
Also, a famous player could gain a higher chance when several players are competitive in a game.
Hence, team information could be considered since players are usually associated with certain teams.

### Labels
In this task, the models learn to predict which player will win the most votes. Hence, it could be formulated as a classification problem.
Given the stats of 44 players (in random order), classify the data into 1 of 44 classes.

Meanwhile, to utilize the information of 2-voted and 1-voted players, the vote results of 44 players could be encoded into a distribution that has three peaks,
with highest peak locating at 3-voted player, and second highest peak locating at 2-voted player,
and the remaining peak locating at 1-voted player. In this case, the output of the model is a distribution, and could be trained with KL divergence loss.


In [None]:
def add_team_info(stats, teams):
    tm = np.array([teams_to_idx[t] for t in teams])
    return np.column_stack([stats, tm])


def add_win_info(stats, home_team, teams, margin):
    wins = np.array([float(margin) if home_team == t else -float(margin) for t in teams])
    return np.column_stack([stats, wins])


def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()


class Data(Dataset):
    def __init__(self, data, augmentation=False, normalization=False, with_team_info=False, with_win_info=False):
        super(Data, self).__init__()
        self.data = data
        self.ids_to_keys = {i: k for i, k in enumerate(data.keys())}
        self.augmentation = augmentation
        self.normalization = normalization
        self.with_team_info = with_team_info
        self.with_win_info = with_win_info

    def __getitem__(self, index):
        data = self.data[self.ids_to_keys[index]]
        stats = data['Stats']
        votes = data['Brownlow Votes']
        if self.normalization:
            stats = normalize_data(stats)
        if self.with_team_info:
            stats = add_team_info(stats, data['Teams'])
        if self.with_win_info:
            stats = add_win_info(stats, data['Home Team'], data['Teams'], data['Margin'])
        if self.augmentation:
            stats, votes = data_augmentation(stats, votes)
        votes_idx = np.argmax(votes)
        votes_dis = softmax(votes)

        return stats, votes_idx, votes_dis

    def __len__(self):
        return len(self.data)

augmentation = True #hyper-param
normalization = True #hyper-param
with_team_info = True #hyper-param
with_win_info = True #hyper-param
batch_size = 128 #hyper-param
train_loader = DataLoader(Data(train_data, augmentation=augmentation, normalization=normalization, with_team_info=with_team_info, with_win_info=with_win_info),
                          batch_size=batch_size,
                          shuffle=True,
                          num_workers=8,
                          pin_memory=True)
val_loader = DataLoader(Data(val_data, augmentation=augmentation, normalization=normalization, with_team_info=with_team_info, with_win_info=with_win_info),
                        batch_size=batch_size,
                        shuffle=True,
                        num_workers=8,
                        pin_memory=True)
test_loader = DataLoader(Data(test_data, augmentation=False, normalization=normalization, with_team_info=with_team_info, with_win_info=with_win_info),
                         batch_size=batch_size,
                         shuffle=False,
                         num_workers=8,
                         pin_memory=True)

# Model creation
In this project, three types of deep neural networks are evaluated. Details are given in the following.

The first model is multi-layer perceptron (MLP). All the players' status are concatenated into a 1-dim vector, and a MLP is used to cope the relationships within the vector.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self, input_dim=44*21, out_dim=44, p=0.4):
        super(MLP, self).__init__()
        self.layers = nn.Sequential(nn.Linear(input_dim, 1024),
                                    nn.ReLU(),
                                    nn.BatchNorm1d(1024),
                                    nn.Linear(1024, 1024),
                                    nn.ReLU(),
                                    nn.BatchNorm1d(1024),
                                    nn.Linear(1024,1024),
                                    nn.ReLU(),
                                    nn.BatchNorm1d(1024),
                                    nn.Dropout(p),
                                    nn.Linear(1024, out_dim))

    def forward(self, x):
        x = x.view(x.size(0), -1)
        out = self.layers(x)
        return out

The second kind of networks are convolutional neural networks (CNNs).

Considering applying the CNNs on images, a convolutional kernel is used to cope the relationships between neighbouring pixels, e.g. 3x3. Each pixel has several channels (e.g. 3 channels in RGB image, 1024 channels in feature maps).

Similarly, in this model, each player is treated as a pixel, and their status are treated as channels. The convolutional kernel copes the relatinoship between neighbouring players in data, and its size determines the neighbour distances.

In [None]:
class CONV(nn.Module):
    def __init__(self, in_channels=21):
        super(CONV, self).__init__()
        self.layers = nn.Sequential(nn.Conv1d(in_channels, 64, 3, 1),
                                    nn.ReLU(),
                                    nn.BatchNorm1d(64),
                                    nn.Conv1d(64, 64, 3, 1),
                                    nn.ReLU(),
                                    nn.BatchNorm1d(64),
                                    nn.Conv1d(64, 128, 3, 2),
                                    nn.ReLU(),
                                    nn.BatchNorm1d(128),
                                    nn.Conv1d(128, 256, 3, 2),
                                    nn.ReLU(),
                                    nn.BatchNorm1d(256))
        self.fc = nn.Linear(256*11, 44)

    def forward(self, x):
        x = self.layers(x)
        out = self.fc(x.view(x.size(0), -1))
        return out

The third kind of model is Transformer, which is proposed in the paper [Attention is all you need](https://arxiv.org/abs/1706.03762 "Attention is all you need").
In this model, each player's stats will be treated as a vector, the relationships between players' status will be reasoned in the network. In theory, it is the most suitable model for this task.

In [None]:
class ATT(nn.Module):
    def __init__(self, input_dim, inner_dim, output_dim):
        super(ATT, self).__init__()
        self.K = nn.Conv1d(input_dim, inner_dim, 1, 1)
        self.Q = nn.Conv1d(input_dim, inner_dim, 1, 1)
        self.V = nn.Conv1d(input_dim, inner_dim, 1, 1)
        self.out = nn.Conv1d(inner_dim, output_dim, 1, 1)
        self.identiy = lambda x: x if output_dim == input_dim else nn.Conv1d(input_dim, output_dim, 1, 1)

    def forward(self, x):
        q = self.Q(x)
        k = self.K(x)
        v = self.V(x)
        s = torch.matmul(q, k.transpose(1, 2))
        s = F.softmax(s, 1)
        output = self.identiy(x) + self.out(torch.matmul(v, s))
        return output


class TRANSFORMER(nn.Module):
    def __init__(self, input_dim=21, ):
        super(TRANSFORMER, self).__init__()
        self.layers = nn.Sequential(ATT(input_dim, 32, 64),
                                    ATT(64, 64, 128),
                                    ATT(128, 64, 128),
                                    ATT(128, 64, 128))
        self.fc = nn.Linear(44*128, 44)

    def forward(self, x):
        return self.fc(self.layers(x).view(x.size(0), -1))

In [None]:
def build_network(network_choice, with_win_info, with_team_info):
    feature_dim = 21
    if with_win_info:
        feature_dim += 1
    if with_team_info:
        feature_dim += 1
    if network_choice == 'mlp':
        return MLP(44*feature_dim)
    elif network_choice == 'conv':
        return CONV(feature_dim)
    elif network_choice == 'transformer':
        return TRANSFORMER(feature_dim)

## Loss function and Accuracy measurement
Loss consists of two parts, the cross entropy loss for classification, and the KL divergence loss for distribution.

Top-k accuracy is used as metric, i.e. select the $k$ indexes that have the highest values, if one of them is the target, the prediction is considered as correct.
Top-1 is conventional accuracy, and top-3 and top-5 accuracies are also used here.

In [None]:
def loss_function(output, target, target_dis=None):
    T = 5 #hyper-parameter
    alpha = 0.5 #hyper-parameter
    if target_dis is None:
        return nn.CrossEntropyLoss()(output, target)
    else:
        return nn.KLDivLoss()(F.log_softmax(output/T, dim=1),
                             F.softmax(target_dis/T, dim=1)) * (alpha * T * T) + \
              F.cross_entropy(output, target) * (1. - alpha)


def metric_function(output, target, topk=1):
    topks, topk_idxes = torch.topk(output, topk, dim=1)
    count = 0
    for p, t in zip(topk_idxes, target):
        if t in p:
            count += 1
    acc = count * 1.0 / len(target)
    return acc

# Training and evaluation framework
The nework is trained on training data and evaluated on validation data iteratively for $30$ epochs.

For each epoch, run the following functions iteratively.

In [None]:
def train(epoch, model, optimizer, dataloader, disstill=False):
    model.train()

    for idx, (data, target, target_dis) in enumerate(dataloader):
        optimizer.zero_grad()
        data, target, target_dis = data.float().to(device), target.to(device), target_dis.float().to(device)
        output = model(data)
        if disstill:
            loss = loss_function(output, target, target_dis)
        else:
            loss = loss_function(output, target)

        loss.backward()
        optimizer.step()
        if idx % 20 == 0:
            print(f'Epoch {epoch} batch {idx}: loss {loss.item()}')
    return


def evaluate(epoch, model, dataloader, is_test=False):
    model.eval()
    top1 = 0
    top3 = 0
    top5 = 0
    for idx, (data, target, _) in enumerate(dataloader):
        data, target = data.float().to(device), target.to(device)
        output = model(data)
        top1 += metric_function(output, target, topk=1) * len(target)
        top3 += metric_function(output, target, topk=3) * len(target)
        top5 += metric_function(output, target, topk=5) * len(target)
        # if idx % 10 == 0:
        #     print(f'Eval {epoch} batch {idx}: Acc {accuracy}')

    num_samples = len(dataloader.dataset)
    print(f'Test/Eval epoch {epoch} top1 is {top1/num_samples}, top3: {top3/num_samples}, top5: {top5/num_samples}')
    return top1

The framework

In [None]:
import torch.optim as optim
network_type = 'transformer'  # hyper-parameter
lr = 0.01 #hyper-param
momentum = 0.9 #hyper-param
wd = 1e-4 #hyper-param
lr_steps = [10, 20, 25] #hyper-param
epochs = 30 #hyper-param
network = build_network(network_type, with_win_info, with_team_info)
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
network = network.to(device)
optimizer = optim.SGD(network.parameters(), lr=lr, momentum=momentum, weight_decay=wd, nesterov=True)
lr_scheduler = optim.lr_scheduler.MultiStepLR(optimizer, lr_steps, gamma=0.1)

best_acc = 0
for epoch in range(epochs):
    lr_scheduler.step(epoch)
    train(epoch, network, optimizer, train_loader)
    mean_acc = evaluate(epoch, network, val_loader)
    if (epoch + 1) % 5 == 0:
        torch.save(network.state_dict(), f'epoch{epoch}.pth')
    if best_acc < mean_acc:
        best_acc = mean_acc
        torch.save(network.state_dict(), 'best.pth')

network.load_state_dict(torch.load('best.pth'))
evaluate(-1, network, test_loader, True)

# Results and Discussions
We can explore different models (MLP/ConvNet/AttentionModel) and hyper-parameters(Optimizer/Learning rate/Number of Layers/Network Depth and width/etc.) to seek the best model with the highest top1 (or top3) accuracy.

The best results (there are better settings for sure) are (top1/top3/top5 accuracies in %) __53.5/81.4/90.5__

Some findings

1. Transformer generally achieves the best performance, with top1/top3/top5 accuracies equalling to 53.5/81.4/90.5. Conv achieves similar performances, while MLP achieves worst performances, with a drop of more than 10 percentage on top1 accuracy.
2. Batch size in training does not affect the results much, as long as it is smaller than 256. In experiments, batch sizes of 32/64/128 achieve similar performances.
3. Data normalization improves the performances a bit, with 4 percentage improvement.
4. Data augmentation improves the performances a lot, with more than 10 percentage improvement.
5. The win/loss information increases the accuracy by  more than 5 percentage, and the team information has little impact on the results.

6. The 2-voted and 1-voted players information does not help the prediction. On the contrary, it reduces the accuracies. The smaller T (temperature), the more accuracy drops.

## Future work
1. Explore architectures of Transformer network
2. Explore player information in prediction
3. Explore more hyper-parameters, such as more training epochs
4. Explore more data augmentation methods