<h1><center><font size="6">Meta Kaggle: What happened to the team size?</font></center></h1>


# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Competitions</a>  
- <a href='#3'>Teams</a>  


# <a id="1">Introduction</a>  

This Kernel objective is to investigate how the team sizes (limited by competitions and formed) evolved in time.   

We will try to understand how many competitions limited the team size each year.    

Then, we will look to the number of teams per year and the number of teams, grouped by year and team size.    

**Note**: the data is last updated on **Aug 24, 2018**.


## Load packages

In [None]:
import pandas as pd 
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
IS_LOCAL = False
import os
if(IS_LOCAL):
    PATH="../input/meta-kaggle"
else:
    PATH="../input"
print(os.listdir(PATH))

## Read the data

In [None]:
competition_df = pd.read_csv(PATH+"/Competitions.csv")
teams_df = pd.read_csv(PATH+"/Teams.csv")
team_membership_df = pd.read_csv(PATH+"/TeamMemberships.csv")

## Check the data

In [None]:
print("Meta Kaggle competition data -  rows:",competition_df.shape[0]," columns:", competition_df.shape[1])
print("Meta Kaggle teams data -  rows:",teams_df.shape[0]," columns:", teams_df.shape[1])
print("Meta Kaggle team memberships data -  rows:",team_membership_df.shape[0]," columns:", team_membership_df.shape[1])

## <a id="2">Competitions</a>

Let's inspect the competition data. We will also look to the columns for missing data.

In [None]:
competition_df.describe()

We will extract the **Deadline Year** from the **Deadline Date**.

In [None]:
competition_df["DeadlineYear"] = pd.to_datetime(competition_df['DeadlineDate']).dt.year

In [None]:
var = ["DeadlineDate", "DeadlineYear", "TeamMergerDeadlineDate", "TeamModelDeadlineDate", "MaxTeamSize", "BanTeamMergers"]
competition_df[var].head(5)

Let's also check missing data.

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data(competition_df[var])

We can see that in 74.8% of the cases, **MaxTeamSize** is not set. This means that the team size is not restricted.

We extract the competition Deadline Year from the Deadline Date.

Let's show the number of competitions having a certain MaxTeamSize, grouped by year.

In [None]:
tmp = competition_df.groupby('DeadlineYear')['MaxTeamSize'].value_counts()
df = pd.DataFrame(data={'Competitions': tmp.values}, index=tmp.index).reset_index()

In [None]:
d2 = competition_df[pd.notnull(competition_df['MaxTeamSize'])]
fig, (ax1, ax2, ax3) = plt.subplots(nrows=3,figsize=(16,16))
s1 = sns.barplot(ax=ax1, x = 'DeadlineYear', y='Competitions',hue='MaxTeamSize',data=df)
s1.set_title("Number of competitions with size of max team set per year")
s2 = sns.countplot(competition_df[pd.notnull(competition_df['MaxTeamSize'])]['DeadlineYear'],ax=ax2)
s2.set_title("Total number of competitions with size of max team set per year")
s3 = sns.countplot(competition_df['DeadlineYear'])
s3.set_title("Total number of competitions per year")
plt.show();

We can observe few interesting things:    
* In 2017, the number of competitions increased to more than double the number in the previous year, 2016;  
* Also in 2017 the number of competition limiting the number of team members increased to more than double  the number in the previous year, 2016; also the number of competitions limiting to only one team member in a team was very large (70% of all competitions);   
* In 2018, the number of competitions was larger (until **Aug 24**, when the data was updated) than in 2017 (for whole year); in the same time, the number of competitions with limited number of team members decreased to a number smaller than the one in 2016.  



# <a id="3">Teams</a>

## Check the data

Let's inspect now the team and team membership datasets.

In [None]:
teams_df.head(5)

In [None]:
missing_data(teams_df)

We can see that we have over **1M** teams registered. Let's look now to the team membership. We can merge team data with competition data (we do not have missing **CompetitionId**, which is the merge field.

In [None]:
team_membership_df.head(5)

In [None]:
missing_data(team_membership_df)

We can merge team membership data with team data (we do not have missing **TeamId**, which is the merge field.

## Teams per year and teams per year and team size

Let's check now the number of teams per year. We will merge Competitions, Teams and Team Membership data.

In [None]:
comp_team_df = competition_df.merge(teams_df, left_on='Id', right_on='CompetitionId', how='inner')
comp_team_membership_df = comp_team_df.merge(team_membership_df, left_on='Id_y', right_on='TeamId', how='inner')

Let's plot the number of teams per year and also the number of teams per year and per number of team members.  
We prepare the dataframe with the number of teams per year and team size.

In [None]:
tmp = comp_team_membership_df.groupby(['DeadlineYear','TeamId'])['Id'].count()
df = pd.DataFrame(data={'Teams': tmp.values}, index=tmp.index).reset_index()
tmp = df.groupby(['DeadlineYear','Teams']).count()
df2 = pd.DataFrame(data=tmp.values, index=tmp.index).reset_index()
df2.columns = ['Year', 'Team size','Teams']

In [None]:
def plot_heatmap_count(data_df,feature1, feature2, color, title):
    #tmp = data_df.groupby([feature1, feature2])['Teams'].count()
    #df1 = tmp.reset_index()
    matrix = data_df.pivot(feature1, feature2, 'Teams')
    fig, (ax1) = plt.subplots(ncols=1, figsize=(16,6))
    sns.heatmap(matrix, 
        xticklabels=matrix.columns,
        yticklabels=matrix.index,ax=ax1,linewidths=.1,linecolor='darkblue',annot=True,cmap=color)
    plt.title(title, fontsize=14)
    plt.show()

Let's show now the number of teams per year.

In [None]:
fig, ax = plt.subplots(nrows=1,figsize=(16,6))
s1 = sns.countplot(comp_team_df['DeadlineYear'], ax=ax)
s1.set_title("Total number of teams per year")
plt.show();

Let's show now the number of teams per year and per number of team members.

In [None]:
plot_heatmap_count(df2,'Team size','Year', 'Blues', "Number of teams grouped by year and by team size")

We can see that large teams were not restricted to 2018. The largest team were actually in:

* 2012 (40 and 23 team members);  
* 2017 (34 team members);
* 2014 (24, 25 team members);   
* 2013 (24 team members);  

What happens in 2017 and 2018 is that sudden increases the number of teams (2017) and of medium-sized teams (4-8 team members).

When checking the number of competition per year we also notice that what happens in 2018 is that the number of competitions without limit of team size increased, as a percent from the total number of competitions. This will explain in part the pattern we observed, that we do have more and more teams (with large size) in 2018. Of course, these findings will have to be revisited after Meta Kaggle is updated with all 2018 data.  


**Note**: This Kernel is still under heavy construction. I will appreciate your feedback and suggestions for improvements.
