comments: 
    - count moves without frequency is missing
    - sequence code does not work yet
    - link notebook to feature importance is missing
    - finish choose a model, testing and results

# Data Mining - StarCraft Player Prediction

The goal of the project was it to predict which player is playing a certain game based on the moves he or she made.

In [1]:
# imports
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
import joblib
import warnings
# Suppress all warnings
warnings.simplefilter("ignore")

## Pre-processing Training Data

To train a model we were given data where each row represented a gameplay. A row included information about the race, the player who was playing and the moves he or she made during the game. 

In [2]:
# Load the training dataset
train_data = pd.read_csv('train_data.csv', delimiter=';')

# Drop unnecessary columns
train_data = train_data.drop(['PlayerURL', 'PlayerName'], axis=1)

train_data.head()

Unnamed: 0,PlayerID,Race,Move 1,Move 2,Move 3,Move 4,Move 5,Move 6,Move 7,Move 8,...,Move 2554,Move 2555,Move 2556,Move 2557,Move 2558,Move 2559,Move 2560,Move 2561,Move 2562,Move 2563
0,1021189,Terran,s,hotkey11,hotkey21,hotkey31,hotkey41,hotkey51,hotkey61,s,...,,,,,,,,,,
1,1021189,Terran,s,s,hotkey11,hotkey21,hotkey31,hotkey41,hotkey51,hotkey61,...,,,,,,,,,,
2,1021189,Terran,s,hotkey11,hotkey21,hotkey31,hotkey41,hotkey51,hotkey61,hotkey71,...,,,,,,,,,,
3,1021189,Terran,s,hotkey11,hotkey21,hotkey31,hotkey41,hotkey51,hotkey61,t5,...,,,,,,,,,,
4,1021189,Terran,s,hotkey11,hotkey21,hotkey31,hotkey41,hotkey51,hotkey71,hotkey61,...,,,,,,,,,,


Before starting with the classification algorithms we had to find a way to process our data and make it comparable. We wanted to reduce dimensions without losing information. In order to do so, we started counting how often a player would use a certain move. We did this over the course of a whole game and also over the course of certain time periods.

In [3]:
# TODO

# Create new table that only contains the first column (PlayerId) of train_data
# Keep only the first column but all rows
train_data_new = train_data.iloc[:, :1]

# Specify the target time intervals
#time_intervals = [20, 60, 100, 200]
time_intervals = [5, 20, 60, 100, 200, 270, 340, 550]

calc_column = len(time_intervals)* 14 + 14

# New lists of counts
counts = [[0] * 3052 for _ in range(calc_column)]
# New lists of races
races = [[0] * 3052 for _ in range(3)]

# Go through the rows using the functions to count the actions, map the races
for row_index, row in train_data.iterrows():
    count_moves(row, counts, row_index)
    mapRaces(races, row_index)

    for ti_index, time_interval in enumerate(time_intervals):
        count_move_per_time(row, counts, row_index, time_interval, ti_index+1)

# TODO

NameError: name 'count_moves' is not defined

We then realised this was not very meaningful. How often a player presses a certain key, does not make him or her recognizable. So we decided to focus on frequency of moves. Meaning how frequent a player was using certain moves. Overall and again also over the course of certain time periods. To standardize the data we also created three new columns that mapped the name of the race (which was of type string before) to a zero or a one. Depending on whether this particular race was being played or not. 
Of course, preprocessing also included deleting unnecessary columns like Player URL and Player Name.

In [4]:
def count_moves(row, counts, index):
    total_moves = 0
    for i in range(1, 2564):
        move = row["Move "+ str(i)]
        # count the number of s's
        if move == 's':
            counts[10][index] += 1
        # count the number of Base's
        elif move == 'Base':
            counts[11][index] += 1
        # count the number of SingleMineral's
        elif move == 'SingleMineral':
            counts[12][index] += 1
        # count the hotkeys
        elif isinstance(move, str):
            for j in range(10):
                if move.startswith(f"hotkey{j}"):
                    counts[j][index] += 1

        total_moves += 1  
    # Save the total moves count
    counts[13][index] = total_moves


def count_move_per_time(row, counts, row_index, time_interval, ti_index):
    base_index = ti_index * 14
    total_moves = 0

    for i in range(1, 2564):
        move = row["Move " + str(i)]

        # Count actions for the given time interval
        if move == 's':
            counts[base_index + 10][row_index] += 1
        elif move == 'Base':
            counts[base_index + 11][row_index] += 1
        elif move == 'SingleMineral':
            counts[base_index + 12][row_index] += 1
        elif isinstance(move, str):
            for j in range(10):
                if move.startswith(f"hotkey{j}"):
                    counts[base_index + j][row_index] += 1

        total_moves += 1

        # Continue counting actions after the specified time interval
        if move == f't{time_interval}':
            break

    counts[base_index + 13][row_index] = total_moves


def mapRaces(races, row_index):
    race = train_data['Race'][row_index]

    if race == "Protoss":
        races[0][row_index] = 1
    elif race == "Terran":
        races[1][row_index] = 1
    elif race == "Zerg":
        races[2][row_index] = 1
        
# Create new table that only contains the first column (PlayerId) of train_data
# Keep only the first column but all rows
train_data_new = train_data.iloc[:, :1]


# Specify the target time intervals
#time_intervals = [20, 60, 100, 200]
time_intervals = [5, 20, 60, 100, 200, 270, 340, 550]

calc_column = len(time_intervals)* 14 + 14

# New lists of counts
counts = [[0] * 3052 for _ in range(calc_column)]
# New lists of races
races = [[0] * 3052 for _ in range(3)]


# Go through the rows using the functions to count the actions, map the races
for row_index, row in train_data.iterrows():
    count_moves(row, counts, row_index)
    mapRaces(races, row_index)

    for ti_index, time_interval in enumerate(time_intervals):
        count_move_per_time(row, counts, row_index, time_interval, ti_index+1)
        

for i in range(calc_column):
    locals()[f'count_{i}'] = counts[i]

for i in range(10):
    train_data_new[f'hk{i}Frequency'] = [count / counts[13][index] if counts[13][index] != 0 else 0 for index, count in enumerate(counts[i])]

train_data_new['sFrequency'] = [count / counts[13][index] if counts[13][index] != 0 else 0 for index, count in enumerate(counts[10])]
train_data_new['baseFrequency'] = [count / counts[13][index] if counts[13][index] != 0 else 0 for index, count in enumerate(counts[11])]
train_data_new['singleMineralFrequency'] = [count / counts[13][index] if counts[13][index] != 0 else 0 for index, count in enumerate(counts[12])]

# Adding new columns for the count of moves per interval
for ti_index, time_interval in enumerate(time_intervals):
    base_index = (ti_index + 1) * 14
    for j in range(10):
        column_name = f'hk{j}_t{time_interval}_Frequency'
        train_data_new[column_name] = [count / counts[base_index + 13][index] if counts[base_index + 13][index] != 0 else 0 for index, count in enumerate(counts[base_index + j])]

    train_data_new[f's_t{time_interval}_Frequency'] = [count / counts[base_index + 13][index] if counts[base_index + 13][index] != 0 else 0 for index, count in enumerate(counts[base_index + 10])]
    train_data_new[f'base_t{time_interval}_Frequency'] = [count / counts[base_index + 13][index] if counts[base_index + 13][index] != 0 else 0 for index, count in enumerate(counts[base_index + 11])]
    train_data_new[f'singleMineral_t{time_interval}_Frequency'] = [count / counts[base_index + 13][index] if counts[base_index + 13][index] != 0 else 0 for index, count in enumerate(counts[base_index + 12])]



# Adding new columns for the races
train_data_new['race_Protoss'] = races[0]
train_data_new['race_Terran'] = races[1]
train_data_new['race_Zerg'] = races[2]

# Saving them in a csv file
train_data_new.to_csv('actiontype_count.csv', index=False)

train_data_new.head()

Unnamed: 0,PlayerID,hk0Frequency,hk1Frequency,hk2Frequency,hk3Frequency,hk4Frequency,hk5Frequency,hk6Frequency,hk7Frequency,hk8Frequency,...,hk6_t550_Frequency,hk7_t550_Frequency,hk8_t550_Frequency,hk9_t550_Frequency,s_t550_Frequency,base_t550_Frequency,singleMineral_t550_Frequency,race_Protoss,race_Terran,race_Zerg
0,1021189,0.0,0.158018,0.107296,0.032384,0.030043,0.007803,0.00039,0.0,0.0,...,0.000994,0.0,0.0,0.0,0.407555,0.0,0.000994,0,1,0
1,1021189,0.0,0.149044,0.062037,0.055014,0.017167,0.001951,0.00039,0.0,0.0,...,0.000853,0.0,0.0,0.0,0.320546,0.0,0.0,0,1,0
2,1021189,0.0,0.130316,0.086617,0.018728,0.030823,0.006243,0.00039,0.00039,0.0,...,0.001289,0.001289,0.0,0.0,0.389175,0.0,0.002577,0,1,0
3,1021189,0.0,0.166602,0.120562,0.033554,0.044479,0.021459,0.001171,0.0,0.0,...,0.000971,0.0,0.0,0.0,0.350485,0.0,0.000971,0,1,0
4,1021189,0.0,0.154506,0.077253,0.02263,0.041748,0.010925,0.00039,0.00039,0.0,...,0.001104,0.001104,0.0,0.0,0.42936,0.0,0.002208,0,1,0


By analysing our train dataset, we realised that the same player uses typically the same sequence of moves, for a certain race type, in the first 10 seconds with slight variations. So we wanted to find a unique sequence for each player that immediately identified him by the use of a specific combination of moves, for a specific type of race played. 
We started working on this idea by creating a function for the research of consecutive sequences of moves, excluding the time data,'t5' and 't10' and iterate the process for each player to find the action sequences.
At the end we saved the found sequences to a text file.
We proceeded by defining a function to evaluate the combination of sequences based on their uniqueness, and applied to each player to obtain ranked combinations.

In [5]:
# Get the indices for columns 'Move_1' to 'Move_50'
move_columns = [f'Move_{i}' for i in range(1, 68)]

data_10s = []

# Iterate through each row of the dataframe
for _, row in train_data.iterrows():
    row_actions = []

    # Iterate through each 'Move_XX' column for the current row
    for col in train_data.columns[3:70]:
        row_actions.append(row[col])

        # Check if the current action is 't10'
        if row[col] == 't10':  # Assuming 't10' is converted to 100 in your previous processing
            break  # Stop iterating if 't10' is found

    data_10s.append(row_actions)

# Convert the result to a new dataframe if needed
data_10s_df = pd.DataFrame(data_10s, columns=move_columns)

data_10s_df.insert(0, 'PlayerID', train_data['PlayerID'])
data_10s_df.insert(1, 'Race', train_data['Race'])

data_10s_df.to_csv('data_10s.csv', index=False)

# Load the data_10s.csv file
data_10s_df = pd.read_csv('data_10s.csv')

# Get the move columns
move_columns = [f'Move_{i}' for i in range(1, 68)]

# Flatten the dataframe to have a single column of moves
all_moves = data_10s_df[move_columns].values.flatten()

# Count the frequency of each move
moves_frequency = {}
for move in all_moves:
    if pd.notna(move):  # Exclude NaN values
        moves_frequency[move] = moves_frequency.get(move, 0) + 1

print("Moves Frequency:")
for move, frequency in moves_frequency.items():
    print(f"{move}: {frequency}")

# Group the data by PlayerID and reset the index
grouped_data = data_10s_df.groupby('PlayerID')[move_columns].apply(lambda x: x.reset_index(drop=True))

# Define a function to find sequences of consecutive moves
def find_sequences(group):
    sequences = []
    for _, g in groupby(enumerate(group), key=lambda x: int(x[1] == 't10')):
        consecutive_moves = list(map(lambda x: x[1], g))

        # Remove 't5' and 't10' from consecutive_moves
        consecutive_moves = [move for move in consecutive_moves if move not in ['t5', 't10']]

        sequences.append(consecutive_moves)
    return sequences

# Iterate through each player's moves and find sequences
player_sequences = {}
for player, moves in grouped_data.iterrows():
    sequences = find_sequences(moves.dropna().astype(str))
    player_sequences[player] = sequences

# Save the found sequences to a text file
output_file_path = 'sequences.txt'
with open(output_file_path, 'w') as file:
    for player, sequences in player_sequences.items():
        file.write(f"{player} : \t")
        for sequence in sequences:
            file.write(','.join(sequence) + '\n')
        file.write("\n")

print(f"Sequences saved to {output_file_path}")

# Define a function to evaluate the combination of moves
def evaluate_combination(combination):
    unique_moves = set(combination[0] + combination[1])
    uniqueness_score = sum(moves_frequency.get(move, 0) for move in unique_moves)
    return uniqueness_score

# Iterate through each player's sequences and find ranked combinations
ranked_combinations = {}
for player, sequences in player_sequences.items():
    combinations_list = list(combinations(sequences, 2))
    ranked_combinations[player] = sorted(combinations_list, key=lambda x: evaluate_combination(x), reverse=True)

# Print or save the ranked combinations
for player, combinations in ranked_combinations.items():
    print(f"Player {player} ranked combinations:")
    for i, combination in enumerate(combinations, start=1):
        print(f"Rank {i}: {combination} - Score: {evaluate_combination(combination)}")

# Save the ranked combinations to a text file
output_file_path = 'ranked_combinations.txt'
with open(output_file_path, 'w') as file:
    for player, combinations in ranked_combinations.items():
        file.write(f"Player {player} ranked combinations:\n")
        for i, combination in enumerate(combinations, start=1):
            file.write(f"Rank {i}: {combination} - Score: {evaluate_combination(combination)}\n")
        file.write("\n")

print(f"Ranked combinations saved to {output_file_path}")

Moves Frequency:
hotkey11: 20
hotkey21: 41
hotkey31: 20
hotkey41: 54
hotkey51: 23
hotkey61: 43
s: 25854
SingleMineral: 540
t5: 3041
hotkey12: 5676
hotkey22: 4988
t10: 3039
hotkey30: 1529
hotkey71: 27
hotkey32: 3827
hotkey20: 1957
hotkey40: 1336
hotkey10: 1844
hotkey90: 643
hotkey70: 488
hotkey80: 488
hotkey42: 4564
hotkey00: 881
hotkey50: 986
hotkey52: 2903
Base: 1920
hotkey81: 19
hotkey91: 56
hotkey01: 81
hotkey60: 654
hotkey62: 1120
hotkey92: 261
hotkey82: 94
hotkey02: 503
hotkey72: 67


NameError: name 'groupby' is not defined

At this point we stopped as we realised that we should create a training model that find the best sequence of actions for each player based on the ranking and then combine this model with others, as RandomForest, in order to predict in the most accurate way the player based also on the total time of the race and on other features.

## Choosing a model

Choosing the right model is crucial, because it directly influences the accuracy of the predictions and therefor impacts the overall success and reliability of our system. We will evaluate the different models we tried by using the F1 accuracy and cross-validation.

1. TreeClassifier
2. Random Forest
3. Random Forest + GridSearch
3. + Feature Importance [Link Text](/Feature Importance.ipynb)
4. + AdaBoost
5. + Voting / Stacking Classifier

In [7]:
# Target
labels = train_data_new['PlayerID']

# Keep only the columns we need as features
features = train_data_new.drop(['PlayerID'], axis=1)

# Split the data into training and testing sets
X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size=0.2, random_state=42)

We started by using the TreeClassifier. This gave us a training accuracy of 76%.

In [8]:
# Choose Decision Tree as a model and train it
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predictions on the val set
predictions = model.predict(X_val)

# Evaluation of model
accuracy = accuracy_score(y_val, predictions)
print(f'Accuracy: {accuracy}')

f1_DT = f1_score(y_val, predictions, average='micro')
print(f'F1 Score on Validation Set: {f1_DT}')

scores = cross_val_score(model, features, labels, cv=4)
print(f'Cross Validation Scores: {scores}')

Accuracy: 0.7610474631751227
F1 Score on Validation Set: 0.7610474631751227
Cross Validation Scores: [0.7051114  0.77588467 0.78505898 0.72608126]


With the RandomForest we were able to increase the training accuracy to 92%

In [9]:
# Choose Random Forest as a model and train it
model = RandomForestClassifier(random_state=42, n_estimators=200)
model.fit(X_train, y_train)

# Predictions on the val set
predictions = model.predict(X_val)

# Evaluation of model
accuracy = accuracy_score(y_val, predictions)
print(f'Accuracy: {accuracy}')

f1_DT = f1_score(y_val, predictions, average='micro')
print(f'F1 Score on Validation Set: {f1_DT}')

scores = cross_val_score(model, features, labels, cv=4)
print(f'Cross Validation Scores: {scores}')

Accuracy: 0.9263502454991817
F1 Score on Validation Set: 0.9263502454991817
Cross Validation Scores: [0.90432503 0.92529489 0.92529489 0.92267366]


For the following changes we were not able to increase our training accuracy. In return they improved our testing accuracy on kaggle and therefore also our model. Firstly we performed hyperparameter tuning using GridSearchCV on the RandomForest.

In [10]:
# Choose a model and train it
model = RandomForestClassifier(random_state=42, n_estimators=200)

# Hyperparameter tuning using GridSearchCV
param_grid = {'n_estimators': [100, 150, 200], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(model, param_grid, cv=4)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

# Use the best model for predictions
predictions = best_model.predict(X_val)

# Evaluation of model
accuracy = accuracy_score(y_val, predictions)
print(f'Accuracy: {accuracy}')

f1_DT = f1_score(y_val, predictions, average='micro')
print(f'F1 Score on Validation Set: {f1_DT}')

scores = cross_val_score(best_model, features, labels, cv=4)
print(f'Cross Validation Scores: {scores}')

Accuracy: 0.9263502454991817
F1 Score on Validation Set: 0.9263502454991817
Cross Validation Scores: [0.90432503 0.92529489 0.92529489 0.92267366]


Then we tested our model and plotted which features contributed the most to the prediction. 
(In another notebook) We did this with with the help of RandomForest feature importace we filtered out 

In [None]:
# Remove 30% of least important features
columns_to_remove = ['hk1_t5_Frequency', 'race_Zerg', 'hk9_t60_Frequency', 'hk5_t5_Frequency', 'hk7_t60_Frequency', 'hk7_t550_Frequency', 'hk9_t340_Frequency', 'hk0_t20_Frequency', 'hk6_t20_Frequency', 'base_t20_Frequency', 'hk8_t200_Frequency', 'hk7_t270_Frequency', 'hk9_t20_Frequency', 'base_t5_Frequency', 'hk9_t200_Frequency', 'hk7_t340_Frequency', 'singleMineral_t550_Frequency', 'singleMineral_t200_Frequency', 'singleMineral_t340_Frequency', 'hk9_t270_Frequency', 'sFrequency', 'hk8_t100_Frequency', 'hk0_t5_Frequency', 'race_Terran', 'singleMineralFrequency', 'hk7_t20_Frequency', 'singleMineral_t270_Frequency', 'singleMineral_t100_Frequency', 'hk8_t60_Frequency', 'hk8_t20_Frequency', 'singleMineral_t60_Frequency', 'hk6_t5_Frequency', 'hk7_t5_Frequency', 'hk8_t5_Frequency', 'singleMineral_t20_Frequency', 'singleMineral_t5_Frequency']
# Remove columns from DataFrame
train_data_new = train_data_new.drop(columns=columns_to_remove)

# Saving them in a csv file
train_data_new.to_csv('actiontype_count.csv', index=False)

train_data_new.head()

# Target
labels = train_data_new['PlayerID']

# Keep only the columns we need as features
features = train_data_new.drop(['PlayerID'], axis=1)

# Split the data into training and testing sets
X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size=0.2, random_state=42)

# Choose a model and train it
model = RandomForestClassifier(random_state=42, n_estimators=200)

# Hyperparameter tuning using GridSearchCV
param_grid = {'n_estimators': [100, 150, 200], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(model, param_grid, cv=4)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

# Use the best model for predictions
predictions = best_model.predict(X_val)

# Evaluation of model
accuracy = accuracy_score(y_val, predictions)
print(f'Accuracy: {accuracy}')

f1_DT = f1_score(y_val, predictions, average='micro')
print(f'F1 Score on Validation Set: {f1_DT}')

scores = cross_val_score(best_model, features, labels, cv=4)
print(f'Cross Validation Scores: {scores}')

## Testing

When given the test data we were looking at almost the same pattern as the training data: Each row was a gameplay with a race and the moves made. Just the identifaction of the player was missing. That one was for us to predict and to do so we used our pretrained model. 

### Precossesing

But before, we had to change the data in the same way we did with the training data. We did this, so they would be able to compare.

### Predicting 

The pre-trained model was then loaded and used to make predictions about the players.

- Same preprocessing as training data to make it comparable
- used pretrained model

Afterwards we added some statistics to asses the testing performance. We wanted to see which predictions were made: ...

And how many different players were predicted. We increased this score over time, because it meant our model was able differentiate better between players.

- how many different players were predicted. we want a higher score because it can differentiate 

At the end we merged the predicted player IDs with the first row of the training data to obtain the corresponding player URLs.
- Merge the predicted ID to get the url of the player

## Results

With our model we managed to reach a level of 82% testing accuracy and 92% training accuracy. 

- training and testing accuracy usually are not the same as in the testing part we try to predict our target from a completely new dataset, which proposes also new features.
- 

### Improvements
We could make some improvements to outr code by doing:
- Feature Engineering:
    We should consider additional features or transformations that might better capture the characteristics of our data and experiment with different aggregations, scaling, or encoding techniques.
    An other thing to do could be to check for highly correlated features and trying top remove the correlated features to reduce multicollinearity.
- Model Selection:
    A part from the model already used, we could experiment with different machine learning models, that combines the results of other models, in order to obtain better results. Some examples could be Gradient Boosting, Support Vector Machines, or Neural Networks, which may capture complex patterns in the data.
- Feature Importance:
    Another improvement could be analyze the feature importances provided by the model, so to understand which features are contributing the most to the predictions, and use this data to clean our dataset and use the features with highest importance