# Meta Kaggle: Kernel Only Competitions

A quick query to see all kernel only competitions to date. Daily submission counts are now plotted too.

## Contents

 * [Table of Kernel-Only Competitions ](#Table-of-Kernel-Only-Competitions-)
 * [Plot Rate of Kernel-Only Competitions Over Time](#Plot-Rate-of-Kernel-Only-Competitions-Over-Time)
 * [Plot Submission Counts](#Plot-Submission-Counts)
 * [Count Teams With Worst Score](#Count-Teams-With-Worst-Score)
 * [Conclusions](#Conclusions)


In [1]:
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import HTML, Image, display

In [2]:
from jt_mk_utils import *

In [3]:
plt.rc('figure', figsize=(10, 6))   
#plt.rc('font', size=14)
plt.style.use('bmh')

In [4]:
DATE_COLS = [
    'EnabledDate', 'DeadlineDate', 'ProhibitNewEntrantsDeadlineDate',
    'TeamMergerDeadlineDate', 'TeamModelDeadlineDate',
    'ModelSubmissionDeadlineDate'
]
comps = read_competitions()
comps['Days'] = (comps.DeadlineDate - comps.EnabledDate).dt.days
comps.shape

In [5]:
comps.columns

**OnlyAllowKernelSubmissions** is the one we need; choose some other columns to show if you like :)

In [6]:
ko = comps.query('OnlyAllowKernelSubmissions').sort_values('DeadlineDate', ascending=False)
ko.shape

In [7]:
ko.DeadlineDate.dt.year.value_counts().sort_index()

# Table of Kernel-Only Competitions 

In [8]:
show = ['Slug', 'DeadlineDate', 'Days', 'TotalTeams', 'RewardType', 'RewardQuantity']

def make_clickable(val):
    return f'<a href="https://www.kaggle.com/c/{val}">{val}</a>'

ko.set_index('Title')[show].style.format({'RewardQuantity': lambda x: f'${x:,.0f}', 'Slug': make_clickable})

# Plot Rate of Kernel-Only Competitions Over Time

Active competitions are not in the dataset (yet) so the drop at the end is not real

In [9]:
dates = comps.DeadlineDate.dt.strftime('%Y-%m')
comps.query('HostSegmentTitle!="InClass"').groupby(dates).OnlyAllowKernelSubmissions.mean().plot();
plt.ylabel('Fraction of Competitions')
plt.title('Mean Rate of Competitions that are Kernel-Only');

In [10]:
teams = read_teams(filter=('CompetitionId', ko.Id)).set_index('Id')
teams.shape

In [11]:
teams.count()

In [12]:
subs = read_csv_filtered(MKDIR / 'Submissions.csv', 'TeamId', teams.index).set_index('Id')
parse_date(subs, 'SubmissionDate')
subs.shape

# Plot Submission Counts

Illustrate the growth or participation rate in these competitions over time. *Team Count* may look high - it includes all those who accepted the rules but did not submit.

Note: Competitions that required re-runs of Kernels seem to have odd submission data...

e.g.

- PetFinder.my Adoption Prediction
- Freesound Audio Tagging 2019
- Quora Insincere Questions Classification
- Jigsaw Unintended Bias in Toxicity Classification
- iMet Collection 2019 - FGVC6
- NFL Big Data Bowl

It seems original (public LB) submissions are not in the data, only the re-runs are here?

In [13]:
for cid, df in teams.groupby('CompetitionId'):
    comp = comps.query('Id==@cid').iloc[0]
    sdf = subs[subs.TeamId.isin(df.index) & (subs.SubmissionDate <= comp.DeadlineDate)]
    cdf = sdf.groupby('SubmissionDate').agg({'TeamId':['size', 'nunique']})
    cdf.columns = ['Submissions', 'Unique Teams']
    display(HTML(
        f'<h1 id="{comp.Slug}">{comp.Title}</h1>'
        f'<ul>'
        f'<li>Deadline Date: {comp.DeadlineDate}'
        f'<li>Teams Ranked: {df.PublicLeaderboardRank.count()}'
        f'<li>Team Count: {df.shape[0]}'
        f'<li>Submission Count: {sdf.shape[0]}'
        f'</ul>'
        )
    )
    try:
        cdf.plot()
        plt.title(f'{comp.Title} — Daily Submissions')
        plt.ylabel('Count')
        plt.grid(True, axis='both')
        plt.show()
    except:
        plt.close()

# Count Teams With Worst Score

This is prompted by 

 - [HuBMAP - Hacking the Kidney](https://www.kaggle.com/c/hubmap-kidney-segmentation)
 - [Human Protein Atlas - Single Cell Classification](https://www.kaggle.com/c/hpa-single-cell-image-classification)

where many teams scored 0 on the private LB because of highly popular publicly shared code that silently failed.

Note that the first code competition:

 - [Mercari Price Suggestion Challenge](https://www.kaggle.com/c/mercari-price-suggestion-challenge)
 
was a two stage competition but very similar: the stage 2 test set was multiple times bigger than in stage 1 and nearly 1/3 of teams failed to run, scoring 99.

In [14]:
teams_on_lb = teams.dropna(subset=['PrivateLeaderboardRank']).copy()
worst_sub_ids = teams_on_lb.sort_values('PrivateLeaderboardRank').groupby('CompetitionId', sort=False).PrivateLeaderboardSubmissionId.last()
worst_scores = worst_sub_ids.map(subs.PrivateScoreLeaderboardDisplay).dropna()
teams_on_lb['PrivateScore'] = teams_on_lb.PrivateLeaderboardSubmissionId.map(subs.PrivateScoreLeaderboardDisplay)
teams_on_lb['WorstScore'] = teams_on_lb.CompetitionId.map(worst_scores)
teams_on_lb['IsWorst'] = teams_on_lb.eval('PrivateScore==WorstScore')

In [15]:
def fmt_float(v):
    return f'{v:,.0f}'

stats = ['count', 'sum', 'mean']
gb = teams_on_lb.groupby('CompetitionId')
df = gb['IsWorst'].agg(stats)
df['Score'] = worst_scores
df = df.join(comps.set_index('Id')[['Title', 'DeadlineDate']])
df = df.set_index('Title')
df = df.sort_values('mean', ascending=False)
df.style.bar(width=85, subset=stats, color='#47e87f').format({'sum': fmt_float, 'count': fmt_float})

In [16]:
plt.scatter(df.DeadlineDate.dt.date, df['mean'])
plt.xlabel('DeadlineDate')
plt.ylabel('Fraction of Teams')
plt.title('Mean Rate of Teams with Worst Score');

# Conclusions

I don't think kernel-only competitions are going to go away!