In the new environment, I have successfully transitioned to a more production-oriented approach. By setting up a comprehensive pipeline, I've streamlined the entire process of data transformation, making it both efficient and scalable. This setup ensures that the test data undergoes the same meticulous preprocessing as the training data, maintaining consistency in the model's application. The adoption of this pipeline is a significant step towards making the code robust and ready for real-world deployment, showcasing a practical application of machine learning in a production scenario.

In [1]:
import numpy as np
import pandas as pd
from joblib import load
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
import warnings
warnings.filterwarnings('ignore')

In [2]:
base_dir = "C:\\Users\\filip.ilic\\Desktop\\NORD\\"
model_path = base_dir + "best_model.joblib"
test_data_path = base_dir + "jobfair_test.csv"

best_model = load(model_path)
df_test = pd.read_csv(test_data_path)

df_test

Unnamed: 0,season,club_id,league_id,dynamic_payment_segment,cohort_season,avg_age_top_11_players,avg_stars_top_11_players,avg_stars_top_14_players,avg_training_factor_top_11_players,days_active_last_28_days,...,playtime_last_28_days,registration_country,registration_platform_specific,league_match_won_count_last_28_days,training_count_last_28_days,global_competition_level,tokens_spent_last_28_days,tokens_stash,rests_stash,morale_boosters_stash
0,174,14542747,2951383,0) NonPayer,1,23,4.295345,4.100333,0.469406,8,...,2580346,France,iOS Phone,10,6,,82,18,177,106
1,174,11019672,2954266,0) NonPayer,21,24,4.704727,4.484933,0.317702,28,...,15521681,Indonesia,Android Phone,16,26,7.0,153,65,1030,717
2,174,14358567,2951259,0) NonPayer,2,22,2.923867,2.819171,0.669540,0,...,0,Brazil,Android Phone,7,0,,0,138,156,98
3,174,14644461,2949546,0) NonPayer,1,22,3.114776,2.977457,0.639923,3,...,1136208,Spain,Android Phone,2,1,,2,43,43,25
4,174,13718978,2952772,0) NonPayer,6,23,4.194497,4.114257,0.486229,26,...,43655679,France,Android Phone,22,46,1.0,100,30,45,53
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60265,174,14614619,2950335,3) Dolphin,1,22,5.717321,5.430162,0.455917,28,...,264650737,Germany,iOS Phone,25,183,5.0,346,50,25,170
60266,174,13805591,2955052,0) NonPayer,5,21,4.079200,3.851924,0.645815,1,...,543672,Indonesia,Android Phone,12,1,2.0,0,12,188,148
60267,174,14559860,2953561,0) NonPayer,1,23,4.839515,4.643467,0.459347,20,...,19993567,Brazil,Android Tablet,16,40,2.0,145,3,9,173
60268,174,14671631,2950158,0) NonPayer,1,26,4.882400,4.596143,0.402482,8,...,49796851,Vietnam,WebGL TE Site,6,69,2.0,109,2,57,84


# Data Preprocessing

In [3]:
class FillNaTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['global_competition_level'] = X['global_competition_level'].fillna(0)
        return X

class CountryBinningTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, high_threshold=2000, low_threshold=200):
        self.high_threshold = high_threshold
        self.low_threshold = low_threshold

    def fit(self, X, y=None):
        self.country_counts = X['registration_country'].value_counts()
        return self

    def transform(self, X):
        X['registration_country_binned'] = X['registration_country'].apply(self.bin_countries)
        return X

    def bin_countries(self, country):
        if self.country_counts[country] > self.high_threshold:
            return 'High Frequency'
        elif self.high_threshold >= self.country_counts[country] > self.low_threshold:
            return 'Medium Frequency'
        elif self.low_threshold >= self.country_counts[country] > 1:
            return 'Low Frequency'
        else:
            return 'Other'

class PlatformBinningTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['registration_platform_binned'] = X['registration_platform_specific'].apply(self.bin_platform)
        return X

    @staticmethod
    def bin_platform(platform):
        if 'Phone' in platform:
            return 'Phone'
        elif 'Tablet' in platform:
            return 'Tablet'
        elif 'PC' in platform or 'Flash TE Site' in platform:
            return 'PC'
        else:
            return 'Web'
        
class DropColumnsTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop(self.columns, axis=1)
        
        
class GroupedAggregationTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, groupby_column):
        self.groupby_column = groupby_column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # List all transformations explicitly
        transformations = [
            ('avg_age_top_11_players', 'avg_age_within_league', 'std_age_within_league'),
            ('avg_stars_top_11_players', 'avg_stars_within_league', 'std_avg_stars_within_league'),
            ('avg_training_factor_top_11_players', 'avg_training_factor_within_league', 'std_training_factor_within_league'),
            ('days_active_last_28_days', 'avg_activeday_within_league', 'std_activeday_within_league'),
            ('league_match_watched_count_last_28_days', 'avg_matchwatched_within_league', 'std_matchwatched_within_league'),
            ('session_count_last_28_days', 'avg_session_within_league', 'std_session_within_league'),
            ('playtime_last_28_days', 'avg_playtime_within_league', 'std_playtime_within_league'),
            ('league_match_won_count_last_28_days', 'avg_won_within_league', 'std_won_within_league'),
            ('training_count_last_28_days', 'avg_training_within_league', 'std_training_within_league'),
            ('tokens_spent_last_28_days', 'avg_tokenspent_within_league', 'std_tokenspent_within_league'),
            ('tokens_stash', 'avg_tokenstash_within_league', 'std_tokenstash_within_league'),
            ('rests_stash', 'avg_restsstash_within_league', 'std_restsstash_within_league'),
            ('morale_boosters_stash', 'avg_mbstash_within_league', 'std_mbstash_within_league')
        ]

        for orig_col, mean_col, std_col in transformations:
            X[mean_col] = X.groupby(self.groupby_column)[orig_col].transform('mean')
            X[std_col] = X.groupby(self.groupby_column)[orig_col].transform('std')

        return X  

    
class ExtractFirstCharTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X[self.column] = X[self.column].str[0]
        return X
    
    
class OneHotEncodingTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
        self.encoders = {col: OneHotEncoder() for col in columns}

    def fit(self, X, y=None):
        for col in self.columns:
            self.encoders[col].fit(X[[col]])
        return self

    def transform(self, X):
        for col in self.columns:
            transformed = self.encoders[col].transform(X[[col]]).toarray()
            columns = [f'{col}_{category}' for category in self.encoders[col].categories_[0]]
            X = X.drop(col, axis=1)
            X[columns] = transformed
        return X    


In [4]:
pipeline = Pipeline([
    ('fill_na', FillNaTransformer()),
    ('country_binning', CountryBinningTransformer()),
    ('platform_binning', PlatformBinningTransformer()),
    ('drop_columns', DropColumnsTransformer(['avg_stars_top_14_players', 'season', 'registration_country', 'registration_platform_specific'])),
    ('extract_first_char', ExtractFirstCharTransformer('dynamic_payment_segment')),
    ('grouped_aggregation', GroupedAggregationTransformer('league_id')),
    ('one_hot_encoding', OneHotEncodingTransformer(['dynamic_payment_segment', 'registration_country_binned', 'registration_platform_binned']))
])

df_test_transformed = pipeline.fit_transform(df_test)

In [5]:
identifiers = df_test_transformed[['league_id', 'club_id']]
feature_names = best_model.feature_names_in_
df_test_features = df_test_transformed[feature_names]

predicted_league_ranks = best_model.predict(df_test_features)

submission = pd.DataFrame({
    'league_id': identifiers['league_id'],
    'club_id': identifiers['club_id'],
    'predicted_league_rank': predicted_league_ranks
})

submission['assigned_rank'] = submission.groupby('league_id')['predicted_league_rank'].rank(method='first', ascending=True)
submission.sort_values(by=['league_id']).head(14)

Unnamed: 0,league_id,club_id,predicted_league_rank,assigned_rank
13220,2949216,14641979,4.4,5.0
18150,2949216,14632000,11.8,13.0
18137,2949216,14644725,5.21,6.0
13263,2949216,14514021,11.15,10.0
55359,2949216,14679772,8.51,7.0
13265,2949216,14688049,11.48,12.0
18126,2949216,14646103,4.13,4.0
55509,2949216,14643992,12.07,14.0
55369,2949216,14625391,11.24,11.0
13311,2949216,14627610,9.22,8.0


In this concluding step of the project, I transformed the model's predictions into meaningful league rankings, aligning with the real-world context of the game. By assigning each club a specific rank within its league, the model's output now reflects the competitive structure crucial to the gaming experience. This process of ranking, rather than just scoring, resonates with the game's objective to provide a fair and engaging competitive environment. It ensures that the predictions are not just numbers but translate directly into the game's framework, enhancing the player experience by predicting and potentially balancing league outcomes. This step bridges the gap between data science and the practical, business-oriented application of the model, tailoring the predictions to meet the game's core objective of fair and challenging competition.


In [6]:
file_path = base_dir + "league_rank_predictions.csv"

final_submission = submission[['club_id', 'assigned_rank']].rename(columns={'assigned_rank': 'predicted_league_rank'})
final_submission['predicted_league_rank'] = final_submission['predicted_league_rank'].astype(int)

final_submission.to_csv(file_path, index=False)