In this competition, you are allowed to use one of the following options: Decision Trees, Rule-based Models, k-Nearest Neighbor, or Naïve Bayes Classifier. If not, you will get a 0 score.

You can use all the deep learning library (pytorch, tensorflow, etc.)

Only individual team is allowed. You should do this competition by yourself.

You must submit your training code to the PLMS. TAs can check whether your code reproduces the results or not. Any significant differences in the reproduced results will result in severe penalties.

If anything that violates the honor code is found, TAs will contact you. If you cannot answer reasonably, you will get severe penalties.

**Evaluation criteria : Weighted F1 score**

**In the classification task, your goal is to predict the position of the player. (Column ‘position’ in csv file)**

### Import modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.model_selection import StratifiedKFold


### Define and preprocessing Dataset

In [None]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.dropna(inplace=True)

X = train_df.drop(['position', 'SEASON_ID', 'TEAM_ID'], axis=1)
y = train_df['position']
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

X_train, X_val, y_train, y_val = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

In [78]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

### Decision Tree

In [79]:
dt = DecisionTreeClassifier(random_state=42)
param_grid_dt = {
    'max_depth': [None, 3, 5, 10, 20, 30, 50],
    'min_samples_split': [2, 5, 10, 15, 20]
}
grid_dt = GridSearchCV(dt, param_grid_dt, scoring='f1_weighted', cv=skf)
grid_dt.fit(X_train, y_train)

print(f'Decision Tree Best parameters: {grid_dt.best_params_}, Best Score: {grid_dt.best_score_}')

best_dt_model = grid_dt.best_estimator_
y_val_pred = best_dt_model.predict(X_val)

val_f1_score = f1_score(y_val, y_val_pred, average='weighted')
print(f'Validation Weighted F1 Score: {val_f1_score}')

Decision Tree Best parameters: {'max_depth': 10, 'min_samples_split': 20}, Best Score: 0.5622818480878923
Validation Weighted F1 Score: 0.5446772148185797


### kNN

In [80]:
knn = KNeighborsClassifier()
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 10, 15],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}
grid_knn = GridSearchCV(knn, param_grid_knn, scoring='f1_weighted', cv=skf)
grid_knn.fit(X_train, y_train)

print(f'K-Nearest Neighbors Best parameters: {grid_knn.best_params_}, Best Score: {grid_knn.best_score_}')

best_knn_model = grid_knn.best_estimator_
y_val_pred = best_knn_model.predict(X_val)

val_f1_score = f1_score(y_val, y_val_pred, average='weighted')
print(f'Validation Weighted F1 Score: {val_f1_score}')

K-Nearest Neighbors Best parameters: {'metric': 'manhattan', 'n_neighbors': 5, 'weights': 'distance'}, Best Score: 0.5555266755819558
Validation Weighted F1 Score: 0.562098605078817


### Predict test data

In [81]:
X_test = test_df.drop(['ID', 'SEASON_ID', 'TEAM_ID'], axis=1)
X_test = X_test.reindex(columns=X.columns, fill_value=0)
X_test = scaler.transform(X_test)

y_test_pred = best_dt_model.predict(X_test)
y_test_pred_decoded = label_encoder.inverse_transform(y_test_pred)
submission_df = pd.DataFrame({'ID': test_df['ID'], 'position': y_test_pred_decoded})
submission_file_path = 'submission_DT.csv'
submission_df.to_csv(submission_file_path, index=False)
print("DT Done.")

y_test_pred = best_knn_model.predict(X_test)
y_test_pred_decoded = label_encoder.inverse_transform(y_test_pred)
submission_df = pd.DataFrame({'ID': test_df['ID'], 'position': y_test_pred_decoded})
submission_file_path = 'submission_kNN.csv'
submission_df.to_csv(submission_file_path, index=False)
print("kNN Done.")

DT Done.
kNN Done.
