## Simple, easy and fast and less overfitting solution with 460 features

This notebook shows problem solving approach using LightGBM Regression and 890 features computed by bruno aquino in the following notebook which are later reduced to 460 features in my approach.

https://www.kaggle.com/braquino/890-features

It also uses the regression coefficients from following notebook by artgor.

https://www.kaggle.com/artgor/quick-and-dirty-regression

Apart from these i also have included resultant LightGBM parameters from exhaustive parameter tuning.

If you find this notebook helpful please press that thumbs up button and thank you :)

PLEASE NOTE THIS IMPORTANT POINT "**DON'T BELIEVE IN PUBLIC LB**" IT'S ONLY 14% of real data that's private!! We should build a model that's less overfittig and still finding the good results."

Your score will be different for different submissions that's because of randomness in gradient boosting!
and that's completely normal you must focus on reducing overfitting, gather as much data as possible and ofcourse reduce the number of features as much as possible without sacrificing model validation score and that's exactly what i've done below :)

## Imports

In [1]:
#####SOME BREAD BUTTER JAM IMPORTS!##
import numpy as np
import pandas as pd
import random
import os
import gc
import warnings
import re
#AND SOME USEFULL IMPORTS
import copy #deep copy porpuses
import json #reading and manipulation of data
from itertools import product #make some combinations, used for feature extraction and combinations
from collections import Counter #counting the occurances, here used in feature extractions
from tqdm import tqdm_notebook #for fancy looking progress
from tqdm import tqdm #for fancy looking progress
import datetime #for time related tasks
import time #for time related tasks
import lightgbm as lgb #star of the show
from sklearn.preprocessing import LabelEncoder #Label encoding of categorical variables
from sklearn.model_selection import GroupKFold #CV purposes
from sklearn.metrics import classification_report, confusion_matrix #For helping with my favourite metric QWK
from sklearn import metrics #self explanatory!
from bayes_opt import BayesianOptimization

**Some necessary settings**

Warnings and other stuff you know!

In [2]:
warnings.filterwarnings("ignore") #ignore all warnings we don't care!!
pd.set_option('max_rows', 500) #for explanatory purposes
pd.options.display.precision = 15 #set default display precision

**Helper functions read, encode and make features from each and every installation data one by one.**

In [3]:
def read_data():
    print('Reading train.csv file....')
    train = pd.read_csv('/kaggle/input/data-science-bowl-2019/train.csv')
    print('Training.csv file have {} rows and {} columns'.format(train.shape[0], train.shape[1]))

    print('Reading test.csv file....')
    test = pd.read_csv('/kaggle/input/data-science-bowl-2019/test.csv')
    print('Test.csv file have {} rows and {} columns'.format(test.shape[0], test.shape[1]))

    print('Reading train_labels.csv file....')
    train_labels = pd.read_csv('/kaggle/input/data-science-bowl-2019/train_labels.csv')
    print('Train_labels.csv file have {} rows and {} columns'.format(train_labels.shape[0], train_labels.shape[1]))

    print('Reading specs.csv file....')
    specs = pd.read_csv('/kaggle/input/data-science-bowl-2019/specs.csv')
    print('Specs.csv file have {} rows and {} columns'.format(specs.shape[0], specs.shape[1]))

    print('Reading sample_submission.csv file....')
    sample_submission = pd.read_csv('/kaggle/input/data-science-bowl-2019/sample_submission.csv')
    print('Sample_submission.csv file have {} rows and {} columns'.format(sample_submission.shape[0], sample_submission.shape[1]))
    return train, test, train_labels, specs, sample_submission

def encode_title(train, test, train_labels):
    # encode title
    train['title_event_code'] = list(map(lambda x, y: str(x) + '_' + str(y), train['title'], train['event_code']))
    test['title_event_code'] = list(map(lambda x, y: str(x) + '_' + str(y), test['title'], test['event_code']))
    all_title_event_code = list(set(train["title_event_code"].unique()).union(test["title_event_code"].unique()))
    # make a list with all the unique 'titles' from the train and test set
    list_of_user_activities = list(set(train['title'].unique()).union(set(test['title'].unique())))
    # make a list with all the unique 'event_code' from the train and test set
    list_of_event_code = list(set(train['event_code'].unique()).union(set(test['event_code'].unique())))
    list_of_event_id = list(set(train['event_id'].unique()).union(set(test['event_id'].unique())))
    # make a list with all the unique worlds from the train and test set
    list_of_worlds = list(set(train['world'].unique()).union(set(test['world'].unique())))
    # create a dictionary numerating the titles
    activities_map = dict(zip(list_of_user_activities, np.arange(len(list_of_user_activities))))
    activities_labels = dict(zip(np.arange(len(list_of_user_activities)), list_of_user_activities))
    activities_world = dict(zip(list_of_worlds, np.arange(len(list_of_worlds))))
    assess_titles = list(set(train[train['type'] == 'Assessment']['title'].value_counts().index).union(set(test[test['type'] == 'Assessment']['title'].value_counts().index)))
    # replace the text titles with the number titles from the dict
    train['title'] = train['title'].map(activities_map)
    test['title'] = test['title'].map(activities_map)
    train['world'] = train['world'].map(activities_world)
    test['world'] = test['world'].map(activities_world)
    train_labels['title'] = train_labels['title'].map(activities_map)
    win_code = dict(zip(activities_map.values(), (4100*np.ones(len(activities_map))).astype('int')))
    # then, it set one element, the 'Bird Measurer (Assessment)' as 4110, 10 more than the rest
    win_code[activities_map['Bird Measurer (Assessment)']] = 4110
    # convert text into datetime
    train['timestamp'] = pd.to_datetime(train['timestamp'])
    test['timestamp'] = pd.to_datetime(test['timestamp'])
    return train, test, train_labels, win_code, list_of_user_activities, list_of_event_code, activities_labels, assess_titles, list_of_event_id, all_title_event_code

def get_data(user_sample, test_set=False):
    '''
    The user_sample is a DataFrame from train or test where the only one 
    installation_id is filtered
    And the test_set parameter is related with the labels processing, that is only requered
    if test_set=False
    '''
    # Constants and parameters declaration
    last_activity = 0
    
    user_activities_count = {'Clip':0, 'Activity': 0, 'Assessment': 0, 'Game':0}
    
    # new features: time spent in each activity
    last_session_time_sec = 0
    accuracy_groups = {0:0, 1:0, 2:0, 3:0}
    all_assessments = []
    accumulated_accuracy_group = 0
    accumulated_accuracy = 0
    accumulated_correct_attempts = 0 
    accumulated_uncorrect_attempts = 0
    accumulated_actions = 0 
    counter = 0
    time_first_activity = float(user_sample['timestamp'].values[0])
    durations = []
    last_accuracy_title = {'acc_' + title: -1 for title in assess_titles}
    event_code_count = {ev: 0 for ev in list_of_event_code}
    event_id_count = {eve: 0 for eve in list_of_event_id}
    title_count = {eve: 0 for eve in activities_labels.values()} 
    title_event_code_count = {t_eve: 0 for t_eve in all_title_event_code}
    time_spent_each_act = {t+"_time": 0 for t in titles}
        
    # itarates through each session of one instalation_id
    for i, session in user_sample.groupby('game_session', sort=False):
        # i = game_session_id
        # session is a DataFrame that contain only one game_session
        
        # get some sessions information
        session_type = session['type'].iloc[0]
        session_title = session['title'].iloc[0]
        session_title_text = activities_labels[session_title]
        
        if (session_type != 'Assessment'):
            time_spent = int(session["game_time"].iloc[-1] / 1000)
            time_spent_each_act[inverse_transform[session_title] + "_time"] += time_spent
        
        # for each assessment, and only this kind off session, the features below are processed
        # and a register are generated
        if (session_type == 'Assessment') & (test_set or len(session)>1):
            # search for event_code 4100, that represents the assessments trial
            all_attempts = session.query('event_code == {}'.format(win_code[session_title]))
            # then, check the numbers of wins and the number of losses
            true_attempts = all_attempts['event_data'].str.contains('true').sum()
            false_attempts = all_attempts['event_data'].str.contains('false').sum()
            # copy a dict to use as feature template, it's initialized with some itens: 
            # {'Clip':0, 'Activity': 0, 'Assessment': 0, 'Game':0}
            features = user_activities_count.copy()
            features.update(last_accuracy_title.copy())
            features.update(event_code_count.copy())
            features.update(event_id_count.copy())
            features.update(title_count.copy())
            features.update(title_event_code_count.copy())
            features.update(last_accuracy_title.copy())
            features.update(time_spent_each_act.copy())
            
            # get installation_id for aggregated features
            features['installation_id'] = session['installation_id'].iloc[-1]
            # add title as feature, remembering that title represents the name of the game
            features['session_title'] = session['title'].iloc[0]
            # the 4 lines below add the feature of the history of the trials of this player
            # this is based on the all time attempts so far, at the moment of this assessment
            features['accumulated_correct_attempts'] = accumulated_correct_attempts
            features['accumulated_uncorrect_attempts'] = accumulated_uncorrect_attempts
            accumulated_correct_attempts += true_attempts 
            accumulated_uncorrect_attempts += false_attempts
            # the time spent in the app so far
            if durations == []:
                features['duration_mean'] = 0
            else:
                features['duration_mean'] = np.mean(durations)
            durations.append((session.iloc[-1, 2] - session.iloc[0, 2] ).seconds)
            # the accurace is the all time wins divided by the all time attempts
            features['accumulated_accuracy'] = accumulated_accuracy/counter if counter > 0 else 0
            accuracy = true_attempts/(true_attempts+false_attempts) if (true_attempts+false_attempts) != 0 else 0
            accumulated_accuracy += accuracy
            last_accuracy_title['acc_' + session_title_text] = accuracy
            # a feature of the current accuracy categorized
            # it is a counter of how many times this player was in each accuracy group
            if accuracy == 0:
                features['accuracy_group'] = 0
            elif accuracy == 1:
                features['accuracy_group'] = 3
            elif accuracy == 0.5:
                features['accuracy_group'] = 2
            else:
                features['accuracy_group'] = 1
            features.update(accuracy_groups)
            accuracy_groups[features['accuracy_group']] += 1
            # mean of the all accuracy groups of this player
            features['accumulated_accuracy_group'] = accumulated_accuracy_group/counter if counter > 0 else 0
            accumulated_accuracy_group += features['accuracy_group']
            # how many actions the player has done so far, it is initialized as 0 and updated some lines below
            features['accumulated_actions'] = accumulated_actions
            
            # there are some conditions to allow this features to be inserted in the datasets
            # if it's a test set, all sessions belong to the final dataset
            # it it's a train, needs to be passed throught this clausule: session.query(f'event_code == {win_code[session_title]}')
            # that means, must exist an event_code 4100 or 4110
            if test_set:
                all_assessments.append(features)
            elif true_attempts+false_attempts > 0:
                all_assessments.append(features)
                
            counter += 1
        
        # this piece counts how many actions was made in each event_code so far
        def update_counters(counter: dict, col: str):
                num_of_session_count = Counter(session[col])
                for k in num_of_session_count.keys():
                    x = k
                    if col == 'title':
                        x = activities_labels[k]
                    counter[x] += num_of_session_count[k]
                return counter
            
        event_code_count = update_counters(event_code_count, "event_code")
        event_id_count = update_counters(event_id_count, "event_id")
        title_count = update_counters(title_count, 'title')
        title_event_code_count = update_counters(title_event_code_count, 'title_event_code')

        # counts how many actions the player has done so far, used in the feature of the same name
        accumulated_actions += len(session)
        if last_activity != session_type:
            user_activities_count[session_type] += 1
            last_activitiy = session_type 
                        
    # if it's the test_set, only the last assessment must be predicted, the previous are scraped
    if test_set:
        return all_assessments[-1]
    return all_assessments

**Get unique titles from train set finding proper mapping for label encoding and also have inverse mapping dict that will be used to make duration related features**

In [4]:
train, test, train_labels, specs, sample_submission = read_data()
labelEncoderTitle = LabelEncoder()
labelEncoderTitle.fit(train["title"].values)
titles = labelEncoderTitle.classes_
print("total {} titles".format(len(titles)))
numbers = labelEncoderTitle.transform(titles)
inverse_transform = {}
index = 0
for number in numbers:
    inverse_transform[number] = titles[index]
    index += 1

Reading train.csv file....
Training.csv file have 11341042 rows and 11 columns
Reading test.csv file....
Test.csv file have 1156414 rows and 11 columns
Reading train_labels.csv file....
Train_labels.csv file have 17690 rows and 7 columns
Reading specs.csv file....
Specs.csv file have 386 rows and 3 columns
Reading sample_submission.csv file....
Sample_submission.csv file have 1000 rows and 2 columns
total 44 titles


# Make Train and Test set

In [5]:
def get_train_and_test(train, test):
    compiled_train = []
    compiled_test = []
    for i, (ins_id, user_sample) in tqdm(enumerate(train.groupby('installation_id', sort = False)), total = 17000):
        compiled_train += get_data(user_sample)
    for ins_id, user_sample in tqdm(test.groupby('installation_id', sort = False), total = 1000):
        test_data = get_data(user_sample, test_set = True)
        compiled_test.append(test_data)
    reduce_train = pd.DataFrame(compiled_train)
    reduce_test = pd.DataFrame(compiled_test)
    categoricals = ['session_title']
    return reduce_train, reduce_test, categoricals

# get usefull dict with maping encode
train, test, train_labels, win_code, list_of_user_activities, list_of_event_code, activities_labels, assess_titles, list_of_event_id, all_title_event_code = encode_title(train, test, train_labels)
# tranform function to get the train and test set
reduce_train, reduce_test, categoricals = get_train_and_test(train, test)

100%|██████████| 17000/17000 [07:24<00:00, 38.25it/s]
100%|██████████| 1000/1000 [00:47<00:00, 21.10it/s]


**Get some features than can provide a brief about overall installation experiance of user**

Based some important historical features which we have created using get_data function!


In [6]:
def preprocess(reduce_train, reduce_test):
    for df in [reduce_train, reduce_test]:
        df['installation_session_count'] = df.groupby(['installation_id'])['Clip'].transform('count')
        df['installation_duration_mean'] = df.groupby(['installation_id'])['duration_mean'].transform('mean')
        #df['installation_duration_std'] = df.groupby(['installation_id'])['duration_mean'].transform('std')
        df['installation_title_nunique'] = df.groupby(['installation_id'])['session_title'].transform('nunique')
        
        df['sum_event_code_count'] = df[[2050, 4100, 4230, 5000, 4235, 2060, 4110, 5010, 2070, 2075, 2080, 2081, 2083, 3110, 4010, 3120, 3121, 4020, 4021, 
                                        4022, 4025, 4030, 4031, 3010, 4035, 4040, 3020, 3021, 4045, 2000, 4050, 2010, 2020, 4070, 2025, 2030, 4080, 2035, 
                                        2040, 4090, 4220, 4095]].sum(axis = 1)
        
        df['installation_event_code_count_mean'] = df.groupby(['installation_id'])['sum_event_code_count'].transform('mean')
        #df['installation_event_code_count_std'] = df.groupby(['installation_id'])['sum_event_code_count'].transform('std')
        
    features = reduce_train.loc[(reduce_train.sum(axis=1) != 0), (reduce_train.sum(axis=0) != 0)].columns # delete useless columns
    features = [x for x in features if x not in ['accuracy_group', 'installation_id']] + ['acc_' + title for title in assess_titles]
   
    return reduce_train, reduce_test, features
# call feature engineering function
reduce_train, reduce_test, features = preprocess(reduce_train, reduce_test)

## REDUCTION BEGINS

**1. Remove features having constant value for approx. 99% of rows**

These features are known as quansi-constant features and can lead to bad results in test set. Because model will not be able to make decision for new values it haven't seen while training!

In [7]:
del_cols = []
for col in reduce_train.columns.values:
    counts = reduce_train[col].value_counts().iloc[0]
    if (counts / reduce_train.shape[0]) >= 0.99:
        del_cols.append(col)
print(str(len(del_cols)) + " features removed!")
reduce_train.drop(del_cols, inplace = True, axis = "columns")
reduce_test.drop(del_cols, inplace = True, axis = "columns")

79 features removed!


**2. Remove features having duplicate values for approx. 99% of rows**

These duplicate features can lead to bad results in test set. Because model will find it difficult to make decision for different values of features it haven't seen while training!

You can uncomment following code and execute it but it will take a lot of time to iterate over all columns and find the duplicate ones so i have included the columns which are to be deleted in next cell :)

In [8]:
"""same_features = {}
counter = 0 
for i_col in tqdm(reduce_train.columns.values, total = len(reduce_train.columns.values)):
    for j_col in reduce_train.columns.values:
        if i_col == j_col:
            continue
        if i_col in same_features:
            if j_col in same_features[i_col]:
                continue
        if j_col in same_features:
            if i_col in same_features[j_col]:
                continue
        same = False
        for col in same_features:
            if i_col in same_features[col] and j_col in same_features[col]:
                same = True
        if same:
            continue
        same_amount = np.sum((reduce_train[i_col] == reduce_train[j_col]).astype(int)) / reduce_train.shape[0]
        if same_amount >= 0.99:
            if not i_col in same_features:
                same_features[i_col] = []
            same_features[i_col].append(j_col)"""

'same_features = {}\ncounter = 0 \nfor i_col in tqdm(reduce_train.columns.values, total = len(reduce_train.columns.values)):\n    for j_col in reduce_train.columns.values:\n        if i_col == j_col:\n            continue\n        if i_col in same_features:\n            if j_col in same_features[i_col]:\n                continue\n        if j_col in same_features:\n            if i_col in same_features[j_col]:\n                continue\n        same = False\n        for col in same_features:\n            if i_col in same_features[col] and j_col in same_features[col]:\n                same = True\n        if same:\n            continue\n        same_amount = np.sum((reduce_train[i_col] == reduce_train[j_col]).astype(int)) / reduce_train.shape[0]\n        if same_amount >= 0.99:\n            if not i_col in same_features:\n                same_features[i_col] = []\n            same_features[i_col].append(j_col)'

same_features is a dict having following format:

same_features["feature_t"] = [list of all features shwoing 99% similarity to feature_t]

and so only feature_t will be kept.

In [9]:
del_cols = ['37c53127',
 'Scrub-A-Dub_2050',
 '65a38bf7',
 'Cart Balancer (Assessment)_2000',
 'Cart Balancer (Assessment)_2020',
 'Pan Balance_3121',
 '8d84fa81',
 '51102b85',
 '15eb4a7d',
 'Bird Measurer (Assessment)_3021',
 '7525289a',
 'Bird Measurer (Assessment)_3121',
 'cc5087a3',
 '4d911100',
 '25fa8af4',
 '7cf1bc53',
 'Happy Camel_4090',
 'cf82af56',
 'All Star Sorting_4030',
 '3afde5dd',
 'b012cd7f',
 'Leaf Leader_2030',
 'e79f3763',
 'ecaab346',
 'b2e5b0f1',
 'Cart Balancer (Assessment)_3121',
 'Cart Balancer (Assessment)_2010',
 'b74258a0',
 '0db6d71d',
 'Dino Drink_3120',
 '89aace00',
 'e5734469',
 'Mushroom Sorter (Assessment)_2010',
 'Cart Balancer (Assessment)_4070',
 'Scrub-A-Dub_2020',
 '3dfd4aa4',
 'Mushroom Sorter (Assessment)_2035',
 '83c6c409',
 'Happy Camel_2080',
 'Chow Time_3121',
 'Cart Balancer (Assessment)_4040',
 '3dcdda7f',
 'Air Show_3121',
 'Air Show_3021',
 '9b4001e4',
 'Chest Sorter (Assessment)_2030',
 '222660ff',
 'Chest Sorter (Assessment)_2010',
 'Bird Measurer (Assessment)_4090',
 'Flower Waterer (Activity)_4022',
 'Dino Dive_2000',
 'Welcome to Lost Lagoon!',
 'Dino Dive_3121',
 'f93fc684',
 'Tree Top City - Level 1_2000',
 '4c2ec19f',
 '12 Monkeys',
 '9ce586dd',
 'Bubble Bath_4040',
 'Cauldron Filler (Assessment)_4025',
 'b120f2ac',
 'All Star Sorting_2025',
 'c277e121',
 'd9c005dd',
 'All Star Sorting_4090',
 '53c6e11a',
 'daac11b0',
 'Chest Sorter (Assessment)_4030',
 'Dino Dive_2020',
 'Cauldron Filler (Assessment)_4070',
 '5e812b27',
 '3ddc79c3',
 '363c86c9',
 'Crystals Rule_4090',
 '5be391b5',
 '1c178d24',
 'Pan Balance_3021',
 '250513af',
 'Bird Measurer (Assessment)_4040',
 4220,
 'Bubble Bath_4220',
 '736f9581',
 '9b23e8ee',
 'Egg Dropper (Activity)_2000',
 '2dcad279',
 'Costume Box',
 '37db1c2f',
 'd3640339',
 'Chest Sorter (Assessment)_2000',
 '155f62a4',
 'Chest Sorter (Assessment)_2020',
 '1325467d',
 'Chow Time_4030',
 'c952eb01',
 4235,
 '85de926c',
 'ad148f58',
 'Bubble Bath_4235',
 'Bubble Bath_4230',
 '6d90d394',
 'Bird Measurer (Assessment)_4020',
 'Fireworks (Activity)_4030',
 '6aeafed4',
 'b80e5e84',
 '1bb5fbdb',
 '262136f4',
 'Dino Dive_2070',
 '15a43e5b',
 'Heavy, Heavier, Heaviest',
 '3ccd3f02',
 '160654fd',
 'Scrub-A-Dub_2030',
 '8d748b58',
 '2a444e03',
 'c189aaf2',
 '49ed92e9',
 'Crystals Rule_4020',
 'Bird Measurer (Assessment)_2020',
 '0d18d96c',
 'Bird Measurer (Assessment)_2000',
 'f71c4741',
 'abc5811c',
 '65abac75',
 '562cec5f',
 'Chow Time_3010',
 'a8876db3',
 '51311d7a',
 'Leaf Leader_4010',
 '30614231',
 '28520915',
 'Chow Time_2030',
 '8f094001',
 'Bird Measurer (Assessment)_2030',
 '14de4c5d',
 'Crystals Rule_3020',
 'ad2fc29c',
 'Scrub-A-Dub_2083',
 '76babcde',
 'Crystals Rule_3121',
 'Bubble Bath_2035',
 'All Star Sorting_2000',
 5010,
 'Watering Hole (Activity)_5010',
 'Pan Balance_4090',
 'Bubble Bath_3121',
 'Sandcastle Builder (Activity)_4020',
 'Pan Balance_3010',
 'Chow Time_3110',
 '3bb91ced',
 'Dino Drink_2075',
 'Chicken Balancer (Activity)_4070',
 'Dino Drink_3010',
 'Crystals Rule_4050',
 4050,
 '47efca07',
 'Leaf Leader_2060',
 'Bird Measurer (Assessment)_3010',
 '6f4bd64e',
 'Scrub-A-Dub_3021',
 'de26c3a6',
 'd2278a3b',
 'Chow Time_4095',
 'Happy Camel_3121',
 'Bubble Bath_3010',
 'd2e9262e',
 'Flower Waterer (Activity)_4030',
 'Treasure Map_2000',
 'd3268efa',
 'Mushroom Sorter (Assessment)_3120',
 'Cauldron Filler (Assessment)_3120',
 'Scrub-A-Dub_3120',
 'Mushroom Sorter (Assessment)_4020',
 'Cart Balancer (Assessment)_3110',
 'Bubble Bath_2025',
 'c54cf6c5',
 'Bubble Bath_4020',
 'e7561dd2',
 'Cart Balancer (Assessment)_4090',
 '363d3849',
 'Dino Drink_4020',
 'Balancing Act',
 'Bottle Filler (Activity)_4030',
 'd45ed6a1',
 'Dino Drink_2070',
 'Leaf Leader_4070',
 'All Star Sorting_3121',
 'Bird Measurer (Assessment)_2010',
 '37ee8496',
 'Watering Hole (Activity)_2000',
 'Dino Drink_3021',
 '92687c59',
 'c58186bf',
 'ecc36b7f',
 'e04fb33d',
 'Bird Measurer (Assessment)_4035',
 '565a3990',
 'Air Show_4110',
 'f806dc10',
 'Cart Balancer (Assessment)_4100',
 '3edf6747',
 '46cd75b4',
 'Bottle Filler (Activity)_4020',
 'a1bbe385',
 'Leaf Leader_2020',
 'b5053438',
 '9c5ef70c',
 'Pan Balance_3120',
 'c1cac9a2',
 "Pirate's Tale",
 '5859dfb6',
 'Bubble Bath_3020',
 '8b757ab8',
 '832735e1',
 '461eace6',
 'bbfe0445',
 '0086365d',
 'd02b7a8e',
 '9e34ea74',
 '27253bdc',
 'Pan Balance_3020',
 'Flower Waterer (Activity)_4025',
 'Mushroom Sorter (Assessment)_3110',
 'Mushroom Sorter (Assessment)_4025',
 'Crystals Rule_2000',
 'Leaf Leader_3121',
 'Tree Top City - Level 3_2000',
 'Happy Camel_4040',
 '2fb91ec1',
 '93b353f2',
 'Crystals Rule_4070',
 '3babcb9b',
 'Cart Balancer (Assessment)_4030',
 'Leaf Leader_3120',
 'Dino Drink_4030',
 '63f13dd7',
 'Mushroom Sorter (Assessment)_2000',
 '3bfd1a65',
 'db02c830',
 'c7fe2a55',
 '1996c610',
 4031,
 'Cauldron Filler (Assessment)_4090',
 '884228c8',
 'Scrub-A-Dub_3110',
 '6c517a88',
 'Bottle Filler (Activity)_3010',
 'e9c52111',
 'All Star Sorting_2020',
 '2b9272f4',
 'Watering Hole (Activity)_4090',
 '90d848e0',
 'Sandcastle Builder (Activity)_4021',
 'Chow Time_2000',
 'Egg Dropper (Activity)_3010',
 'Leaf Leader_4090',
 '8ac7cce4',
 '9e4c8c7b',
 'Cauldron Filler (Assessment)_3010',
 'Happy Camel_2030',
 'Bubble Bath_4070',
 'Bird Measurer (Assessment)_4070',
 'bc8f2793',
 'Scrub-A-Dub_4020',
 'Bird Measurer (Assessment)_4100',
 'Happy Camel_3110',
 '5de79a6a',
 'Fireworks (Activity)_2000',
 'Bubble Bath_2020',
 'Happy Camel_4095',
 'Chest Sorter (Assessment)_3020',
 'Dino Drink_2060',
 'Chow Time_3020',
 '022b4259',
 'Cauldron Filler (Assessment)_2020',
 'Dino Dive_3020',
 '9d29771f',
 'd88ca108',
 'c2baf0bd',
 'Flower Waterer (Activity)_4070',
 'Air Show_4070',
 'e694a35b',
 '6c930e6e',
 'Watering Hole (Activity)_3110',
 'Bug Measurer (Activity)_2000',
 'a5e9da97',
 '763fc34e',
 'Dino Dive_4020',
 'b88f38da',
 'Happy Camel_3020',
 'Fireworks (Activity)_4090',
 'Air Show_2060',
 'Chest Sorter (Assessment)_4040',
 'Bubble Bath_2030',
 'Air Show_4020',
 '9b01374f',
 '99ea62f3',
 '4ef8cdd3',
 'Happy Camel_3120',
 '85d1b0de',
 'Chest Sorter (Assessment)_3121',
 'Chest Sorter (Assessment)_3021',
 'All Star Sorting_4020',
 'Slop Problem',
 '56cd3b43',
 'Sandcastle Builder (Activity)_4090',
 'Air Show_3010',
 'Scrub-A-Dub_4010',
 'c7128948',
 'Crystal Caves - Level 3_2000',
 'b1d5101d',
 'Magma Peak - Level 2',
 '6f8106d9',
 'Pan Balance_4020',
 'Chicken Balancer (Activity)_4030',
 'Ordering Spheres_2000',
 'Happy Camel_4035',
 '04df9b66',
 '15f99afc',
 'cdd22e43',
 '7da34a02',
 'bdf49a58',
 'Bug Measurer (Activity)_3010',
 '7f0836bf',
 'Magma Peak - Level 1',
 '28a4eb9a',
 'Happy Camel_4020',
 'Air Show_2030',
 '33505eae',
 '31973d56',
 '17113b36',
 '2230fab4',
 '3b2048ee',
 'Chow Time_3021',
 'Pan Balance_4070',
 'Happy Camel_4030',
 '5348fd84',
 'Mushroom Sorter (Assessment)_4090',
 'beb0a7b9',
 '16dffff1',
 'Scrub-A-Dub_2080',
 'Bubble Bath_2000',
 'Egg Dropper (Activity)_4090',
 'bd612267',
 'Rulers_2000',
 'd88e8f25',
 'fbaf3456',
 '1575e76c',
 'Chicken Balancer (Activity)_3010',
 'Sandcastle Builder (Activity)_2000',
 '47f43a44',
 'Chicken Balancer (Activity)_4020',
 'Lifting Heavy Things_2000',
 '71fe8f75',
 'Chicken Balancer (Activity)_2000',
 'Honey Cake_2000',
 'c74f40cd',
 '5154fc30',
 'Cauldron Filler (Assessment)_4100',
 'Crystals Rule_2030',
 'Mushroom Sorter (Assessment)_3010',
 '00c73085',
 'e37a2b78',
 'Air Show_2075',
 'a592d54e',
 '99abe2bb',
 '795e4a37',
 'Bottle Filler (Activity)_3110',
 'Bird Measurer (Assessment)_4025',
 'Dino Dive_4010',
 '8d7e386c',
 'All Star Sorting_4070',
 'Crystal Caves - Level 2_2000',
 'Happy Camel_4070',
 'Watering Hole (Activity)_5000',
 'a6d66e51',
 'b7530680',
 'Cart Balancer (Assessment)_4035',
 'df4fe8b6',
 'd185d3ea',
 'Dino Dive_3110',
 'Bubble Bath_3021',
 'sum_event_code_count',
 '19967db1',
 '15ba1109',
 'Watering Hole (Activity)_4021',
 '2a512369',
 'All Star Sorting_4010',
 'Dino Dive_2060',
 'Leaf Leader_2070',
 'All Star Sorting_2030',
 'Cart Balancer (Assessment)_4020',
 'b2dba42b',
 '84b0e0c8',
 'Tree Top City - Level 2',
 '2b058fe3',
 'Chow Time_4070',
 '7423acbc',
 'dcaede90',
 2040,
 'e3ff61fb',
 'Crystal Caves - Level 1',
 'd3f1e122',
 'cb1178ad']

In [10]:
print("Deleting " + str(len(del_cols)) + " features!")
reduce_train.drop(del_cols, inplace = True, axis = "columns")
reduce_test.drop(del_cols, inplace = True, axis = "columns")

Deleting 403 features!


# Training phase begins

**Cappa loss helper method defination**

In [11]:
#Loss Function Decleration
def qwk_loss(a1, a2):
    max_rat = 3
    a1 = np.asarray(a1, dtype=int)
    a2 = np.asarray(a2, dtype=int)
    hist1 = np.zeros((max_rat + 1, ))
    hist2 = np.zeros((max_rat + 1, ))
    o = 0
    for k in range(a1.shape[0]):
        i, j = a1[k], a2[k]
        hist1[i] += 1
        hist2[j] += 1
        o +=  (i - j) * (i - j)
    e = 0
    for i in range(max_rat + 1):
        for j in range(max_rat + 1):
            e += hist1[i] * hist2[j] * (i - j) * (i - j)
    e = e / a1.shape[0]
    return 1 - o / e

**We Gonna have float values as predictions from our regression model so this is an helper function to convert these float values into labels 0,1,2,3 according to certain thresholds!**

In [12]:
def regr_resl_to_label(true_labels, preds_labels):
    preds_labels[preds_labels <= 1.12232214] = 0
    preds_labels[np.where(np.logical_and(preds_labels > 1.12232214, preds_labels <= 1.73925866))] = 1
    preds_labels[np.where(np.logical_and(preds_labels > 1.73925866, preds_labels <= 2.22506454))] = 2
    preds_labels[preds_labels > 2.22506454] = 3
    return 'cappa', qwk_loss(true_labels, preds_labels), True

**Method to train model that can be tweaked to be used by multiple other datasets.**

In [13]:
scores = []
def train_model(X: pd.DataFrame,
                y,
            folds = None,
            params: dict = None,
            del_cols: list = None):

    """Basic parameters
        1. X: train_data
        2. y: ground truth labels
        3. params: lightGBM parameters
        4. del_cols: columns to be avoided while training like accuracy_group must not be a column! 
    """
    global scores
    eval_metric = regr_resl_to_label #custom metric as defined above
    columns = [col for col in X.columns.values if not col in del_cols] #features
    
    models = [] #save n_folds models
    n_target = 1 # number of targets
    oof = np.zeros((len(X), n_target)) # out of fold predictions

    for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y, X['installation_id'])):
        
        print('Fold {} started at {}'.format(fold_n + 1,time.ctime()))
        X_train, X_valid = X.loc[train_index,columns], X.loc[valid_index,columns]
        y_train, y_valid = y.loc[train_index], y.loc[valid_index]
        print(X_train.shape)
        
        #Eval set preparation
        eval_set = [(X_train, y_train)]
        eval_names = ['train']
        eval_set.append((X_valid, y_valid))
        eval_names.append('valid')
        categorical_columns = 'auto'
        
        model = lgb.LGBMRegressor(**params)
        model.fit(X=X_train, y=y_train,
                       eval_set=eval_set, eval_names=eval_names, eval_metric=eval_metric,
                       verbose=params['verbose'], early_stopping_rounds=params['early_stopping_rounds'],
                       categorical_feature=categorical_columns)
        
        oof[valid_index] = model.predict(X_valid).reshape(-1, n_target)
        score = regr_resl_to_label(X.loc[valid_index,"accuracy_group"],oof[valid_index])
        scores.append(score)
        models.append(model)
    scores = [score[1][0] for score in scores]
    print(scores)
    return models


**very simple prediction method, Just get the results from multiple models from different folds and average them all**

In [14]:
def predict(models, X_test, averaging: str = 'usual'):
    full_prediction = np.zeros((X_test.shape[0], 1))
    for i in range(len(models)):
        X_t = X_test.copy()
        if cols_to_drop is not None:
            del_cols = [col for col in cols_to_drop if col in X_t.columns.values]
            X_t = X_t.drop(del_cols, axis=1)
        y_pred = models[i].predict(X_t).reshape(-1, full_prediction.shape[1])
        if full_prediction.shape[0] != len(y_pred):
            full_prediction = np.zeros((y_pred.shape[0], 1))
        if averaging == 'usual':
            full_prediction += y_pred
        elif averaging == 'rank':
            full_prediction += pd.Series(y_pred).rank().values
    return full_prediction / len(models)

**My best parameters found by many parameter tuning techniques and less overfitting**

In [15]:
params = {'verbose': 100,
          'learning_rate': 0.010514633017309072,
          'metric': 'rmse',
          'bagging_freq': 3,
          'boosting_type': 'gbdt',
          'eval_metric': 'cappa',
          'lambda_l1': 4.8999704874480745,
          'colsample_bytree': 0.4236269531042225,
          'early_stopping_rounds': 100,
          'max_depth': 12,
          'lambda_l2': 0.054084652510602016,
          'bagging_fraction': 0.7931423220563563,
          'n_jobs': -1,
          'n_estimators': 2000,
          'objective': 'regression',
          'seed': 42}

**You may get Light gbm error regarding special json characters in names of some columns**

These two lines will get deal with those errors for you :)

reduce_train.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in reduce_train.columns]

reduce_test.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in reduce_test.columns]

In [16]:
reduce_train.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in reduce_train.columns]
reduce_test.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in reduce_test.columns]

# no need for these columns in training
cols_to_drop = ['game_session', 'installation_id', 'timestamp', 'accuracy_group', 'timestampDate'] + [col for col in reduce_train.columns.values if "_time" in str(col)]#ground truth fact labels
y = reduce_train['accuracy_group']
#group k-fold and please don't go for just k-fold
n_fold = 5
folds = GroupKFold(n_splits=n_fold)
models = train_model(X = reduce_train, y = y,folds = folds, params = params, del_cols = cols_to_drop)

Fold 1 started at Sun Dec 15 08:02:00 2019
(14152, 441)
Training until validation scores don't improve for 100 rounds
[100]	train's rmse: 1.07733	train's cappa: 0.46883	valid's rmse: 1.09718	valid's cappa: 0.443364
[200]	train's rmse: 0.995906	train's cappa: 0.614059	valid's rmse: 1.03227	valid's cappa: 0.568282
[300]	train's rmse: 0.951165	train's cappa: 0.654268	valid's rmse: 1.00347	valid's cappa: 0.598592
[400]	train's rmse: 0.922111	train's cappa: 0.673092	valid's rmse: 0.988236	valid's cappa: 0.614281
[500]	train's rmse: 0.90148	train's cappa: 0.68664	valid's rmse: 0.981241	valid's cappa: 0.618828
[600]	train's rmse: 0.884281	train's cappa: 0.698081	valid's rmse: 0.976476	valid's cappa: 0.619236
Early stopping, best iteration is:
[557]	train's rmse: 0.891519	train's cappa: 0.693657	valid's rmse: 0.978499	valid's cappa: 0.621845
Fold 2 started at Sun Dec 15 08:03:11 2019
(14152, 441)
Training until validation scores don't improve for 100 rounds
[100]	train's rmse: 1.07416	train's 

In [17]:
preds = predict(models, reduce_test)
    
coefficients = [1.12232214, 1.73925866, 2.22506454]
preds[preds <= coefficients[0]] = 0
preds[np.where(np.logical_and(preds > coefficients[0], preds <= coefficients[1]))] = 1
preds[np.where(np.logical_and(preds > coefficients[1], preds <= coefficients[2]))] = 2
preds[preds > coefficients[2]] = 3

In [18]:
sample_submission['accuracy_group'] = preds.astype(int)
sample_submission.to_csv('submission.csv', index=False)
sample_submission['accuracy_group'].value_counts(normalize=True)

2    0.331
3    0.324
1    0.182
0    0.163
Name: accuracy_group, dtype: float64