<h1><center><font size="6">Meta Kaggle: What happened to the team size?</font></center></h1>


# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Competitions</a>  
  - <a href='#21'>Check the data</a>   
  - <a href='#22'>Competition types</a>   
  - <a href='#23'>Number of competitions, grouped by year and Max Team Size</a>
  - <a href='#24'>Number of competitions, grouped by year, Max Team Size and Host Segment Title</a>
- <a href='#3'>Teams</a>   
  - <a href='#31'>Check the data</a>   
  - <a href='#32'>Teams per year and teams per year and team size</a>   
  - <a href='#33'>Number of teams per team size and year heatmap (all competitions)</a>   
  - <a href='#34'>Number of teams for team size and year heatmap (no InClass competitions)</a>   
  - <a href='#35'>Time variation of number of teams vs. team size (with plotly and blobby)</a>   
  - <a href='#36'>Time variation of number of winning teams vs. team size (with plotly and blobby)</a>   
  - <a href='#37'>Teams size and teams rankings</a>     
- <a href='#4'>References</a>   



# <a id="1">Introduction</a>  

This Kernel objective is to investigate how the team sizes (limited by competitions and formed) evolved in time.   

We will use the data from **Meta Kaggle**<a href='#4'>[1]</a>.

We will try to understand how many competitions limited the team size (**MaxTeamSize**) each year. The type of the competition is also an important factor and will show the results as well grouped on competition type (**HostSegmentTitle**).   

Then, we will look to the number of teams per year and the number of teams, grouped by year and team size.    

**Note**: the data is last updated on **Sept 1, 2018**.


## Load packages

In [None]:
import pandas as pd 
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
from bubbly.bubbly import bubbleplot 
from __future__ import division
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
IS_LOCAL = False
import os
if(IS_LOCAL):
    PATH="../input/meta-kaggle"
else:
    PATH="../input"
print(os.listdir(PATH))

## Read the data

In [None]:
competition_df = pd.read_csv(PATH+"/Competitions.csv")
teams_df = pd.read_csv(PATH+"/Teams.csv")
team_membership_df = pd.read_csv(PATH+"/TeamMemberships.csv")

## Check the data

In [None]:
print("Meta Kaggle competition data -  rows:",competition_df.shape[0]," columns:", competition_df.shape[1])
print("Meta Kaggle teams data -  rows:",teams_df.shape[0]," columns:", teams_df.shape[1])
print("Meta Kaggle team memberships data -  rows:",team_membership_df.shape[0]," columns:", team_membership_df.shape[1])

# <a id="2">Competitions</a>

## <a id="21">Check the data</a>

Let's inspect the competition data. We will also look to the columns for missing data.

In [None]:
competition_df.describe()

We will extract the **Deadline Year** from the **Deadline Date**.

In [None]:
competition_df["DeadlineYear"] = pd.to_datetime(competition_df['DeadlineDate']).dt.year

## <a id="22">Competition types</a>

Let's verify how many competitions types are. There are two fields,  **CompetitionTypeId** and **HostSegmentTitle**. As indicated by [James Trotman,](https://www.kaggle.com/jtrotman) the second is more meaningful for our purpose. Let's visualize the host segment title distribution.

In [None]:
fig, (ax) = plt.subplots(nrows=1,figsize=(12,4))
sns.countplot(competition_df['HostSegmentTitle'], ax=ax)
plt.show()

Most of the competitions are of type **InClass** (almost 700), on second position are the **Featured** then **Research** and **Playground** and **Recruitment**. Let's look closer to some of the features more relevant for our subject.

In [None]:
var = ["DeadlineDate", "DeadlineYear", "CompetitionTypeId", "HostSegmentTitle", "TeamMergerDeadlineDate", "TeamModelDeadlineDate", "MaxTeamSize", "BanTeamMergers"]
competition_df[var].head(5)

Let's also check missing data.

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data(competition_df[var])

We can see that in 74.9% of the cases, **MaxTeamSize** is not set. This means that the team size is not restricted.

Let's replace not defined **MaxTeamSize** with **-1**.

In [None]:
competition_df.loc[competition_df['MaxTeamSize'].isnull(),'MaxTeamSize'] = -1

## <a id="23">Number of competitions, grouped by year and Max Team Size</a>


Let's show the number of competitions having a certain MaxTeamSize, grouped by year.

In [None]:
tmp = competition_df.groupby('DeadlineYear')['MaxTeamSize'].value_counts()
df = pd.DataFrame(data={'Competitions': tmp.values}, index=tmp.index).reset_index()

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(nrows=3,figsize=(16,16))
s1 = sns.barplot(ax=ax1, x = 'DeadlineYear', y='Competitions',hue='MaxTeamSize',data=df[df['MaxTeamSize']>-1])
s1.set_title("Number of competitions with size of max team set per year")
s2 = sns.countplot(competition_df[competition_df['MaxTeamSize']>-1]['DeadlineYear'],ax=ax2)
s2.set_title("Total number of competitions with size of max team set per year")
s3 = sns.countplot(competition_df['DeadlineYear'])
s3.set_title("Total number of competitions per year")
plt.show();

We can observe few interesting things:    
* In 2017, the number of competitions increased to more than double the number in the previous year, 2016;  
* Also in 2017 the number of competition limiting the number of team members increased to more than double  the number in the previous year, 2016; also the number of competitions limiting to only one team member in a team was very large (70% of all competitions);   
* In 2018, the number of competitions was larger (until **Sept 1**, when the data was updated) than in 2017 (for whole year); in the same time, the number of competitions with limited number of team members decreased to a number smaller than the one in 2016.  

Let's look also to the distribution of the competitions grouped on year, max team size and host segment title.  
 **MaxTeamSize** = <font color="red">**-1**</font> means that there is no Max Team Size set.
 
 
 ## <a id="24">Number of competitions, grouped by year, Max Team Size and Host Segment Title</a>



In [None]:
tmp = competition_df.groupby(['DeadlineYear','MaxTeamSize'])['HostSegmentTitle'].value_counts()
df = pd.DataFrame(data={'Competitions': tmp.values}, index=tmp.index).reset_index()

In [None]:
f, ax = plt.subplots(5,2, figsize=(16,28))
for i, Year in enumerate(df['DeadlineYear'].unique()):
    df1 = df[df['DeadlineYear']==Year]
    s1 = sns.barplot(ax=ax[i//2, i%2], x = 'MaxTeamSize', y='Competitions',hue='HostSegmentTitle',data=df1)
    s1.set_title("Year {}".format(Year))
plt.show();

We can see that most of the competitions with **MaxTeamSize** not set (-1) are **InClass** competitions since 2016. In 2017 and 2018 allmost all competitions are **InClass**.    

Let's show all competitions besides **InClass** competitions and represent only the rest of the competitions.

In [None]:
f, ax = plt.subplots(5,2, figsize=(16,28))
df0 = df[df['HostSegmentTitle']!='InClass']
for i, Year in enumerate(df['DeadlineYear'].unique()):
    df1 = df0[df0['DeadlineYear']==Year]
    s1 = sns.barplot(ax=ax[i//2, i%2], x = 'MaxTeamSize', y='Competitions',hue='HostSegmentTitle',data=df1)
    s1.set_title("Year {}".format(Year))
plt.show();

If we look to the competitions that are not **InClass**, we can see that we do have majority of competitions either **Featured** and **Research** and most are with no **MaxTeamSize** set.    

From the competitions with deadline in 2018, there is only one **Featured** competition with **MaxTeamSize** set to 3. All the rest of **Featured** competitions with the deadline in 2018 have not a **MaxTeamSize** set.

# <a id="3">Teams</a>

## <a id="31">Check the data</a>

Let's inspect now the team and team membership datasets.

In [None]:
teams_df.head(5)

In [None]:
missing_data(teams_df)

We can see that we have over **1M** teams registered. Let's look now to the team membership. We can merge team data with competition data (we do not have missing **CompetitionId**, which is the merge field.

In [None]:
team_membership_df.head(5)

In [None]:
missing_data(team_membership_df)

We can merge team membership data with team data (we do not have missing **TeamId**, which is the merge field.

## <a id="32">Teams per year and teams per year and team size</a>

Let's check now the number of teams per year. We will merge Competitions, Teams and Team Membership data.

In [None]:
comp_team_df = competition_df.merge(teams_df, left_on='Id', right_on='CompetitionId', how='inner')
comp_team_membership_df = comp_team_df.merge(team_membership_df, left_on='Id_y', right_on='TeamId', how='inner')

Let's plot the number of teams per year and also the number of teams per year and per number of team members.  
We prepare the dataframe with the number of teams per year and team size.

In [None]:
tmp = comp_team_membership_df.groupby(['DeadlineYear','TeamId'])['Id'].count()
df = pd.DataFrame(data={'Teams': tmp.values}, index=tmp.index).reset_index()
tmp = df.groupby(['DeadlineYear','Teams']).count()
df2 = pd.DataFrame(data=tmp.values, index=tmp.index).reset_index()
df2.columns = ['Year', 'Team size','Teams']

In [None]:
def plot_heatmap_count(data_df,feature1, feature2, color, title):
    matrix = data_df.pivot(feature1, feature2, 'Teams')
    fig, (ax1) = plt.subplots(ncols=1, figsize=(16,6))
    sns.heatmap(matrix, 
        xticklabels=matrix.columns,
        yticklabels=matrix.index,ax=ax1,linewidths=.1,linecolor='darkblue',annot=True,cmap=color)
    plt.title(title, fontsize=14)
    plt.show()

Let's show now the number of teams grouped by year.

In [None]:
fig, ax = plt.subplots(nrows=1,figsize=(16,6))
s1 = sns.countplot(comp_team_df['DeadlineYear'], ax=ax)
s1.set_title("Total number of teams per year")
plt.show();

Let's also show the number of teams grouped by year and by competition type.

In [None]:
tmp = comp_team_df.groupby('DeadlineYear')['HostSegmentTitle'].value_counts()
df = pd.DataFrame(data={'Competitions': tmp.values}, index=tmp.index).reset_index()

In [None]:
fig, ax = plt.subplots(nrows=1,figsize=(16,6))
s1 = sns.barplot(ax=ax, x='DeadlineYear', y='Competitions', hue='HostSegmentTitle', data = df)
s1.set_title("Total number of teams per year, grouped by HostSegmentTitle (Competition type)")
plt.show();

We can see that most of the teams are for the **Featured** competitions.

Let's show now the number of teams per year and per number of team members.

## <a id="33">Number of teams per team size and year heatmap (all competitions)</a>

In [None]:
plot_heatmap_count(df2,'Team size','Year', 'Reds', "Number of teams grouped by year and by team size")

We can see that large teams were not restricted to 2018. The largest team were actually in:

* **2012** (**40** and **23** team members);  
* **2017** (**34** team members);
* **2014** (**24**, **25** team members);   
* **2013** (**24** team members);  

What happens in 2017 and 2018 is that sudden increases the number of teams (2017) and of medium-sized teams (4-8 team members).

When checking the number of competition per year we also notice that what happens in 2018 is that the number of competitions without limit of team size increased, as a percent from the total number of competitions. This will explain in part the pattern we observed, that we do have more and more teams (with large size) in 2018. Of course, these findings will have to be revisited after Meta Kaggle is updated with all 2018 data.  

Let's remove the **InClass** competitions and plot again the number of teams grouped by year and team size.

## <a id="34">Number of teams for team size and year heatmap (no InClass competitions)</a>


In [None]:
no_inclass_df = comp_team_membership_df[comp_team_membership_df['HostSegmentTitle']!='InClass']
tmp = no_inclass_df.groupby(['DeadlineYear','TeamId'])['Id'].count()
df = pd.DataFrame(data={'Teams': tmp.values}, index=tmp.index).reset_index()
tmp = df.groupby(['DeadlineYear','Teams']).count()
df2 = pd.DataFrame(data=tmp.values, index=tmp.index).reset_index()
df2.columns = ['Year', 'Team size','Teams']
plot_heatmap_count(df2,'Team size','Year', 'Blues', "Number of teams grouped by year and by team size (no InClass comp.)")

As we expected, by removing **InClass** competitions we obtained very similar result as for all competitions, since majority of the teams are formed for **Featured** competitions.

## <a id="35">Time variation of number of teams vs. team size (with plotly and blobby)</a>

Let's represent now on a single graph, using **blobby** <a href='#4'>[2]</a><a href='#4'>[3]</a> (bubble plot using **plotly** <a href='#4'>[4]</a>) the time variation (yearly) of **Teams** (number of teams) vs. **Team size**. For each **Host Segment Title**  - i.e. type of competition and **Team size** there is a separate bubble displayed. The bubble size is proportional with the team size (on sqrt scale).  The number of teams scale is logarithmic. The plot is animated, with one plot frame for each **year**.

In [None]:
tmp = comp_team_membership_df.groupby(['DeadlineYear','TeamId', 'HostSegmentTitle'])['Id'].count()
df3 = pd.DataFrame(data={'Teams': tmp.values}, index=tmp.index).reset_index()
tmp = df3.groupby(['DeadlineYear','HostSegmentTitle','Teams']).count()
df4 = pd.DataFrame(data=tmp.values, index=tmp.index).reset_index()
df4.columns = ['Year', 'Host Segment Title','Team size','Teams']
df4['TeamsSqrt'] = np.sqrt(df4['Teams'] + 2)

In [None]:
figure = bubbleplot(dataset=df4, x_column='Team size', y_column='Teams', color_column = 'Team size',
    bubble_column = 'Host Segment Title', time_column='Year', size_column = 'TeamsSqrt',
    x_title='Team size', y_title='Number of Teams [log scale]', 
    title='Number of Teams vs. Team size - time variation (years)', 
    colorscale='Rainbow', colorbar_title='Team size', 
    x_range=[-5,41], y_range=[-0.4,7], y_logscale=True, scale_bubble=5, height=650)
iplot(figure, config={'scrollzoom': True})

## <a id="36">Time variation of number of winning teams vs. team size (with plotly and blobby)</a>

Let's focus now on the winning teams (teams with bronze, silver or gold medals). We will only select **Featured** competition.

In [None]:
feature_df = comp_team_membership_df[comp_team_membership_df['HostSegmentTitle']=='Featured']

We represent the wining teams grouped by medal (Gold, Silver and Bronze) and team size.   
In the graph, on one axis we have the team size (x-axis) and on the other axis we have the number of teams (y-axis).

In [None]:
tmp = feature_df.groupby(['DeadlineYear','TeamId', 'Medal'])['Id'].count()
df3 = pd.DataFrame(data={'Teams': tmp.values}, index=tmp.index).reset_index()
tmp = df3.groupby(['DeadlineYear','Medal','Teams']).count()
df4 = pd.DataFrame(data=tmp.values, index=tmp.index).reset_index()
df4.columns = ['Year', 'Medal','Team size','Teams']
df4['Rank'] = (df4['Medal'] - 1) / 2
df4['Size'] = 4 - df4['Medal']

In [None]:
bins = [-0.01, 0.49, 0.99, np.inf]
names = ['Gold', 'Silver', 'Bronze']
df4['MedalName'] = pd.cut(df4['Rank'], bins, labels=names)

In [None]:
figure = bubbleplot(dataset=df4, x_column='Team size', y_column='Teams', color_column = 'Rank',
    bubble_column = 'MedalName', time_column='Year', size_column = 'Size', 
    x_title='Team size', y_title='Number of Teams [log scale]', 
    colorscale = [[0, "gold"], [0.5, "silver"], [1,"brown"]],
    title='Number of Winning Teams vs. Team size - time variation (years)', 
    x_range=[-5,41], y_range=[-0.4,4], y_logscale=True, scale_bubble=0.2, height=650)
iplot(figure, config={'scrollzoom': True})

We can observe that the largest teams winning a medal were in each year:
* 2010: 1 team winning **gold**, with **4** members;  
* 2011: 1 team winning **gold**, with **12** members;  
* 2012: 1 team winning **bronze**, with **40** members;  
* 2013: 2 teams winning **gold**, with **24** members;  
* 2014: 1 team winning **bronze**, with **6** members;   
* 2015: 1 team winning **bronze**, with **18** members;  
* 2016: 1 team winning **gold**, with **13** members;  
* 2017: 1 team winning **bronze**, with **34** members;  
* 2018: 1 team winning **silver**, with **23** members;  




In [None]:
df5 = df4[df4['Medal']==1.0]
plot_heatmap_count(df5,'Team size','Year', 'Greens', "Number of Gold winning teams grouped by year and by team size")

We can observe several things:
* In **2018** the number of **gold** winning teams increased only for the teams with **2**, **5** and **7** members;  
* The largest teams winning gold were in **2013** (**24**, **10** members), **2012** (**23**, **15**, **12** members), **2011** (**12** members), **2016** (**13** and **11** members) and **2017** (**10** members);  


## <a id="37">Teams size and teams rankings</a>  


We select the teams for **Featured** competitions and we will check if the teams size is correlated in any way with the teams rankings.   


For this, we count the number of teams members for each team and then we merge back the result with the **teams_df** dataset, to have in one dataset the number of team members per team and the public leaderboard rank.


In [None]:
tmp = feature_df.groupby(['TeamId'])['Id'].count()
df = pd.DataFrame(data={'Team Size': tmp.values}, index=tmp.index).reset_index()
#merge back df with teams_df
df2 = df.merge(teams_df, left_on='TeamId', right_on='Id', how='inner')
var = ['Team Size', 'PublicLeaderboardRank', 'PrivateLeaderboardRank' ]
teams_ranks_df = df2[var]

In [None]:
corr = teams_ranks_df.corr()
sns.heatmap(corr,xticklabels=corr.columns,yticklabels=corr.columns,
            cmap="YlGnBu",linewidths=.1,annot=True,vmin=-1, vmax=1)
plt.show()

We can observe that while there is an obvious strong correlation between the public and private leaderboard rank, there is no correlation (values under 0.1 and negative) between the team size and the public or private leaderboard rank. The negative very small correlation coeficient prevent us to draw any conclusion about existence of an (inverse) correlation. Let's check if this factor changes significantly over years.

In [None]:
df2["Year"] = pd.to_datetime(df2['LastSubmissionDate']).dt.year
var = ['Team Size', 'PublicLeaderboardRank', 'PrivateLeaderboardRank']
years = df2['Year'].unique()
years = np.sort(years[~np.isnan(years)])

In [None]:
f, ax = plt.subplots(3,3, figsize=(16,16))
for i, year in enumerate(years):
    teams_ranks_df = df2[df2['Year']==year]
    corr = teams_ranks_df[var].corr()
    labels = ['Size', 'Public', 'Private']
    axi = ax[i//3, i%3]
    s1 = sns.heatmap(corr,xticklabels=labels,yticklabels=labels,
                     cmap="YlGnBu",linewidths=.1,annot=True,vmin=-1, vmax=1,ax=axi)
    s1.set_title("Year: {}".format(year))
plt.show()

Although a very small negative value (in the range of **no inverse correlation**), we can observe the following for the  value of correlation between Private Leaderboard Rank & Public Leaderboard Rank with the Team Size:  

*  It is a inverse very small correlation factor (i.e. teams size increases with the lower value of rank or, the closer to the top, the larger the teams;  
* Module of values increased over the last years, from values of -0.02 in 2010 to -0.12 and -0.1 in 2017 and 2018, respectively;  
* When different, in general inverse correlations for Public are larger i.e. teams tend to be larger for higher positions on the public leaderboard;  


# <a id="4">References</a>  

[1] Meta Kaggle, https://www.kaggle.com/kaggle/meta-kaggle  
[2] <a href="https://www.kaggle.com/aashita">Aashita Kesarwani</a>, https://www.kaggle.com/aashita/guide-to-animated-bubble-charts-using-plotly  
[3] <a href="https://www.kaggle.com/aashita">Aashita Kesarwani</a>,  https://github.com/AashitaK/bubbly/blob/master/bubbly/bubbly.py  
[4] Plotly, https://community.plot.ly/   