# ADCC Fighters EDA and Clustering (KMeans) <a id="section1"></a>

This notebook contains exploration and analysis of data on athletes who competed in the ADCC grappling championship, focusing on visualization.  
It also includes the use of said data to train an unsupervised model (KMeans) to group athletes in clusters. The idea is for this to make some trends and relationships visible that were not easily perceivable in the original data.  
  
Sports fans in general tend to be very passionate when discussing fighters and specialized media usually craft and push specific narratives to the public. The goal of this notebook is to generate data based insights to be compared to what is commonly perceived in this context as truth about specific athletes, styles, eras and more. That way, different stories can be told relying less on subjective perceptions, emotional reactions and personal preferences.

# Contents <a id="section2"></a>
  
- [Intro](#section1)
- [EDA & visualization](#section3)
- [Data preparation for unsupervised clustering](#section4)
- [Clustering](#section5)
- [Matches dataset visualization](#section6)
- [Final message](#section7)


In [1]:

# Importing necessary data manipulation, visualization and machine learning libraries
import numpy as np 
import pandas as pd 
import plotly.express as px
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import plotly.graph_objects as go
from sklearn.metrics import silhouette_score




In [2]:
# Importing the data from a CSV file
df = pd.read_csv('/kaggle/input/adcc-fighter-stats/fighters_dataset.csv')
df.head()

Unnamed: 0,name,win_ratio,total_fights,sub_win_ratio,point_win_ratio,decision_win_ratio,n_editions_competed,scored_points_per_fight,suffered_points_per_fight,fights_per_edition,...,highest_match_importance,open_weight_ratio,n_titles,champion,custom_score,n_different_subs,fought_superfight,total_wins,debut_year,female
0,Murilo Santana,0.454545,11.0,0.181818,0.181818,0.090909,7.0,-0.272727,0.0,1.571429,...,3,0.363636,0,0,-0.165699,1,0,5.0,2009,0
1,Nicholas Meregali,0.75,8.0,0.125,0.25,0.25,2.0,0.125,-0.375,4.0,...,4,0.5,0,0,1.97733,1,0,6.0,2022,0
2,Nick Rodriguez,0.7,10.0,0.1,0.4,0.2,4.0,2.3,-0.2,2.5,...,4,0.2,0,0,0.854628,1,0,7.0,2019,0
3,Otavio Sousa,0.625,8.0,0.375,0.125,0.125,4.0,-0.75,0.5,2.0,...,4,0.0,0,0,0.567981,2,0,5.0,2013,0
4,Orlando Sanchez,0.692308,13.0,0.153846,0.153846,0.384615,8.0,-0.384615,-0.461538,1.625,...,4,0.0,1,1,0.424658,2,0,9.0,2013,0


Dataset sample after importing

In [3]:
# General distribution statistics on each feature of the dataset
df.describe()

Unnamed: 0,win_ratio,total_fights,sub_win_ratio,point_win_ratio,decision_win_ratio,n_editions_competed,scored_points_per_fight,suffered_points_per_fight,fights_per_edition,n_weight_classes,...,highest_match_importance,open_weight_ratio,n_titles,champion,custom_score,n_different_subs,fought_superfight,total_wins,debut_year,female
count,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,...,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,0.196217,3.348534,0.070387,0.099999,0.023665,2.016287,0.095231,0.038001,1.370352,1.058632,...,1.296417,0.111214,0.164495,0.092834,2.314471e-17,0.452769,0.0,1.674267,2010.021173,0.070033
std,0.306604,5.320313,0.171129,0.199468,0.09267,2.278289,0.698188,0.616438,0.786067,0.267577,...,1.563267,0.251079,0.633983,0.290436,1.000815,1.162302,0.0,4.038326,7.627623,0.25541
min,0.0,1.0,0.0,0.0,0.0,1.0,-1.0,-1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,-0.4113964,0.0,0.0,0.0,1998.0,0.0
25%,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,-0.4113964,0.0,0.0,0.0,2003.0,0.0
50%,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,-0.3636218,0.0,0.0,0.0,2011.0,0.0
75%,0.5,3.0,0.0,0.0,0.0,2.0,0.0,0.0,1.553571,1.0,...,2.0,0.0,0.0,0.0,-0.02920021,0.0,0.0,1.0,2017.0,0.0
max,1.0,43.0,1.0,1.0,1.0,15.0,6.8,7.0,13.0,4.0,...,5.0,1.0,6.0,1.0,16.40523,8.0,0.0,30.0,2022.0,1.0


General descriptive statistics on the dataset features

# Exploratory data analysis and visualizations<a id="section3"></a>

In [4]:
# Creating a dataframe to display most frequent targets  
# that athletes target or are targetted for in submissions

df.loc[df['most_vulnerable'] == 'No specific vulnerability', 'most_vulnerable'] = 'No specific target'
df['most_vulnerable'].value_counts()
targets = pd.concat([df['most_vulnerable'].value_counts(), df['favorite_target'].value_counts()], axis=1)
targets.rename(columns={'most_vulnerable':'Most frequent target when getting submitted', 'favorite_target':'Most frequent target when submitting'},inplace=True)
targets.head()

Unnamed: 0,Most frequent target when getting submitted,Most frequent target when submitting
No specific target,293,491
Neck,162,64
Leg,70,33
Arm,70,22
Other/Unknown,19,4


Number of athletes for each preferred target when submitting oponents and most frequent target when getting submitted.

### Which body parts are the most targetted by (and on) athletes for submissions?

In [5]:
fig = px.bar(targets, text_auto=True, width=800, height=600,
             labels={'variable': 'Target'}, hover_name='variable',
            color_discrete_map={"Most frequent target when getting submitted": "indianred",
                                "Most frequent target when submitting": "lightseagreen"})
fig.update_layout(legend=dict(
    orientation="v",
    yanchor="bottom",
    y=0.84,
    xanchor="right",
    x=1,
    bgcolor='rgba(0, 0, 0, 0)'
),
    xaxis_title="Submission target body part", yaxis_title="Number of athletes",
    title_text='Top targets for submitting & getting submitted by athlete',
    title_x=0.47
)



fig.show()

Since ADCC is a highly sought after prestigious event for elite grapplers for all over the world, the level of competition in the event is high enough that it's usually hard for them to submit each other.  
Even with a ruleset specifically tweaked to increase submission rates, it's not uncommon for athletes to play defensively, given how much is at stake.  
Because of that, we can see that most fighters don't have enough submission data (from ADCC boughts, at least) available for us to determine what their preferred target when submitting oponents is or where they are most vulnerable to being submitted.  
  
Still, it can be seen that wee have more leg specialists than arm specialists but even combined they are not as numerous as the athletes who get their most submissions by neck attacks.

In [6]:
# Creates a dataframe containing only athletes who fought in finals or superfights
highlvl = df[df['highest_match_importance'] > 3].copy()

### How are win and submission rates distributted among the highest level athletes?
The scatter plot below displays only data from the highest level fighters (who've competed in finals and/or superfights)

In [7]:
# Plots the distribution between win rates and submission rates
# For athletes who competed in finals or superfights
highlvl.sort_values(by='n_titles', inplace=True, ascending=False)
# Generates the chart object by instanciating a plotly express histogram
fig = px.scatter(highlvl, x='win_ratio', y='sub_win_ratio', hover_name="name",
                color=highlvl['n_titles'].astype(str),
                size='total_wins', size_max=40, opacity=0.6,
                color_discrete_sequence=px.colors.sequential.Plasma,
                labels={'color': 'Number of titles'},
                width=800, height=800)


fig.update_layout(
    title={
        'text': "Win ratio vs Submission ratio",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Win ratio", yaxis_title="Submission ratio"

)


# Display observation under title explaining bubble size
legend_annotations = [
    dict(
        x=0.54,
        y=0.99,
        xref="paper",
        yref="paper",
        text=f"(Bubble size proportional to total wins)",
        showarrow=False,
        font=dict(size=12),
    )]

fig.update_layout(annotations=legend_annotations)


fig.show()

This type of chart showscases how impressive some athletes' achievements are, such as Ricardo Arona's undefeated status with 13 total wins and 4 titles.  
The promise Kade Ruotolo showcased winning 4 matches by submission in his first appearance also stands out.

### How does performance vary among different submission specialties?  
***Custom Score*** is an engineered feature calculated using different metrics such as victories/losses at each competition level and number of appearances by the athlete.

In [8]:
# Plots the distribution between win rates and submission rates

# Generates the chart object by instanciating a plotly express histogram
fig = px.scatter(highlvl, x='custom_score', y='sub_win_ratio', hover_name="name",
                 color='favorite_target', labels={'favorite_target': 'Most frequent submission target'},
                width=900, height=810, size='total_wins', log_x=True,
                color_discrete_sequence=px.colors.qualitative.Bold, size_max=40)


fig.update_layout(
    title={
        'text': "Submission ratio by custom score",
        'y':0.95,
        'x':0.42,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Custom score", yaxis_title="Submission ratio"

)


# Display observation under title explaining bubble size
legend_annotations = [
    dict(
        x=0.52,
        y=0.99,
        xref="paper",
        yref="paper",
        text=f"(Bubble size proportional to Titles won)",
        showarrow=False,
        font=dict(size=12),
    )]

fig.update_layout(annotations=legend_annotations)


fig.show()

This chart clearly displays the difference between those who never got to showcase their preferred target for submissions at the highest level and those who could.  
It's also possible to see that Arm specialist in general displayed similar results with this metrics, so Fabricio Werdum, Ronaldo Souza, Alexandre Ribeiro and FFion Davies are packed in close proximity. Kade Ruotolo's an evident outlier from this group, as is Ricardo Arona for the Leg attackers.  
Being the most common preferred target among athletes, Neck attackers are a more broadly distributted group here.


### How is the *highest* level of competition an athlete has faced distributted?

In [9]:
# Plots the distribution of the values for most important match fighters competed in

# Generates the chart object by instanciating a plotly express histogram 
fig = px.histogram(x=df['highest_match_importance'])
fig.update_layout(
    title={
        'text': "Distribution of highest match importance fighter competed in",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Match importance, 0 being the first round, 5 finals and 6 superfights",
    yaxis_title="Number of athletes"
)

# Calculates the percentage for each value
total_count = len(df['highest_match_importance'])
percentages = df['highest_match_importance'].value_counts(normalize=True) * 100

# Adds the percentage text on top of each bar
fig.update_traces(text=percentages.round(2).astype(str) + '%', textposition='auto')

# Modify the tick names in the x-axis
custom_labels = {
'0':'Other',
'1':'Quarterfinals',
'2':'Semifinals',
'3':'3rd place',
'4':'Finals',
'5':'Superfight'
}

fig.update_xaxes(ticktext=list(custom_labels.values()),
                 tickvals=list(custom_labels.keys()))


fig.show()

It makes sense to logically expect this chart to have decreasing heights from left to right on the bars, since more important bouts are fought by athletes who bested others in previous rounds, but that's not what can be seen here.  

Since there's a lot of missing data in the original data source (BJJ Heroes ADCC bouts stats), this distribution might be explained by the probable trend that more important bouts are more likely to have data available on the website.

### How is the *average* level of competition an athlete has faced distributted?

In [10]:
# Plots the distribution of the values for average match importance fighters competed in

# Generates the chart object by instanciating a plotly express histogram 
fig = px.histogram(df, x='avg_match_importance')
fig.update_layout(
    title={
        'text': "Distribution of avegrage match importance fighter competed in",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Match importance, 0 being the first round, 5 finals and 6 superfights",
    yaxis_title="Number of athletes"
)

# Calculates the percentage for each value
total_count = len(df['avg_match_importance'])
percentages = df['avg_match_importance'].value_counts(normalize=True) * 100

# Modify the tick names in the x-axis
custom_labels = {
'0':'Other',
'1':'Quarterfinals',
'2':'Semifinals',
'3':'3rd place',
'4':'Finals',
'5':'Superfight'
}

fig.update_xaxes(ticktext=list(custom_labels.values()),
                 tickvals=list(custom_labels.keys()))

# Add the percentage text on top of each bar
fig.update_traces(text=percentages.round(2).astype(str) + '%', textposition='auto')

fig.show()

### How is the total number of wins distributted among athletes?

In [11]:
# Plots the distribution of total wins for athletes

# Generates the chart object by instanciating a plotly express histogram 
fig = px.histogram(df, x='total_wins')
fig.update_layout(
    title={
        'text': "Distribution of total matches the athlete won",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Total matches won by athlete",
    yaxis_title="Number of athletes"
)

# Calculates the percentage for each value
total_count = len(df['total_wins'])
percentages = df['total_wins'].value_counts(normalize=True) * 100

# Add the percentage text on top of each bar
fig.update_traces(text=percentages.round(2).astype(str) + '%', textposition='auto')

fig.show()

This is closer to what one might expect from such graph. The distribution closely follows the logic that success in competitive sports can be visualized as a pyramid with the most successful athletes being a few at the top and many at the bottom with less success.

### How many different submissions have athletes managed to use to end matches?

In [12]:
# Plots the distribution of total number of different submissions athletes have performed

# Generates the chart object by instanciating a plotly express histogram 
fig = px.histogram(df, x='n_different_subs', hover_name="n_different_subs")
fig.update_layout(
    title={
        'text': "Distribution of number of different submissions",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Amount of different submissions scored by the athlete",
    yaxis_title="Number of athletes"
)

# Calculates the percentage for each value
total_count = len(df['n_different_subs'])
percentages = df['n_different_subs'].value_counts(normalize=True) * 100

# Add the percentage text on top of each bar
fig.update_traces(text=percentages.round(2).astype(str) + '%', textposition='auto')

fig.show()

The 0 bar shows that almost 80% of ADCC competitors have never submit an oponent in the event.  
It's important to note that to display a higher value in this feature, the athlete most not only perform with higher technical level and adaptability but also achieve more opportunities to actually fight more matches.

### How does the average of points per match (scored and conceded) differ between athletes?  
It's worth *pointing* out that ADCC is notorious for being a competition that encourages submissions over point scoring.

In [13]:
# Plots the distribution between averages for points conceded and points scored by fighters

# Generates the chart object by instanciating a plotly express histogram 
fig = px.scatter(df, x='suffered_points_per_fight', y='scored_points_per_fight', hover_name="name",
                color=df['female'].astype(str), labels={'color': 'Sex'})


fig.update_layout(
    title={
        'text': "Average points Scored Vs. Average points Conceded",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Average points conceded per fight", yaxis_title="Average points scored per fight"

)

# Renames the legend labels for understandability
newnames = {'0':'Male', '1': 'Female'}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name]))


fig.show()

Paulo Miyao, Royler Gracie and Ffion Davies stand out with great points differential (their total of points / their oponents')

### Is weight class related to the target body part of successful submissions?

In [14]:

# Calculate the counts and proportions for each category in 'most_vulnerable_target' within each 'main_weight_class'
df_grouped = df.groupby(['main_weight_class', 'most_vulnerable']).size().reset_index(name='count')
df_grouped['proportion'] = df_grouped.groupby('main_weight_class')['count'].transform(lambda x: x / x.sum())
df_grouped = df_grouped[df_grouped['most_vulnerable'] != 'No specific target']

# Create a histogram chart using Plotly Express
fig = px.histogram(df_grouped, x='main_weight_class', y='proportion',
                   color='most_vulnerable', barmode='group',
                  width=900, height=600, hover_name='most_vulnerable'
                  )

# Modify the tick names in the x-axis
custom_labels = {
'0':'66kg (60kg for females)',
'1':'77kg',
'2':'88kg',
'3':'99kg',
'4':'+99kg (+60kg for females)',
}
fig.update_xaxes(ticktext=list(custom_labels.values()), tickvals=list(custom_labels.keys()))



# Update the layout of the chart
fig.update_layout(title='Target body part of successful Submissions by Weight Class',
                  xaxis_title='Main Weight Class',
                  yaxis_title='Submission probability for body part',
                  legend_title='Submission target',
                 legend=dict(x=0.75, y=0.99,bgcolor='rgba(0, 0, 0, 0)')
                 )

fig.update_layout(title_x=0.47)

fig.show()

The plot above shows that heavier athletes are less likely to submit oponents by using neck attacks. Despite having lower submission rates in general, these athletes tend to possess higher submission rates for legs and arms specifically.

### How are win rate and average match importance related?

In [15]:
# Plots the distribution between average match importance and win rates

# Generates the chart object by instanciating a plotly express histogram
fig = px.scatter(df, x='avg_match_importance', y='win_ratio', hover_name="name",
                 color=df['champion'].astype(str), labels={'color': 'Has title?'},
                size='total_wins', size_max=30, log_x=True,
                width=900, height=810)


fig.update_layout(
    title={
        'text': "Win ratio by average match importance for fighters",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Average match importance, 0 being the first round, 5 finals and 6 superfights", yaxis_title="Win ratio"

)

# Display observation under title explaining bubble size
legend_annotations = [
    dict(
        x=0.546,
        y=1,
        xref="paper",
        yref="paper",
        text=f"(Bubble size proportional to total wins)",
        showarrow=False,
        font=dict(size=12),
    )]
fig.update_layout(annotations=legend_annotations)

# Renames the legend labels for understandability
newnames = {'0':'No', '1': 'Title winner'}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name]))


fig.show()

It can be seen that the average match importance for fighters is not a good feature to differentiate them, since they all appear closely packed together in the above chart.  
Still, setting the X-axis scale to logarithmic softens this effect.  
The trend observed here (of positive correlation between match importance and average win rate) is to be expected since fighters advance to later stages (matches with higher importance) by winning matches with less importance.


### How are winners and champions from each generation different?

In [16]:
# Plots the distribution between total wins and year of debut (first appearance)

# Generates the chart object by instanciating a plotly express histogram
fig = px.scatter(df, x='debut_year', y='total_wins', hover_name="name",
                 color=df['champion'].astype(str), labels={'color': 'Has title?'},
                size='sub_win_ratio')


fig.update_layout(
    title={
        'text': "Total wins by debut year",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Debut year (edition of first appearance)", yaxis_title="Total wins"

)

# Display observation under title explaining bubble size
legend_annotations = [
    dict(
        x=0.546,
        y=1.04,
        xref="paper",
        yref="paper",
        text=f"(Bubble size proportional to submission rate)",
        showarrow=False,
        font=dict(size=12),
    )]

# Edits legend label for understandability
newnames = {'0':'No', '1': 'Title winner'}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name]))

fig.update_layout(annotations=legend_annotations)


fig.show()

It's worth noting that a "talent drought" or shortage of capable fresh athletes was never experienced in ADCC, with fighters from all the different generations going on to achieve impressive and or unprecedented feats.  
*Alexandre Ribeiro* had the most total wins among athletes who had their debut in the first decade (1998-2008) of the competition, *Andre Galvao* in the second decade (2008-2018)and *Gordon Ryan* in the third and current decade (2018-present).

### How are win and submission rates related?

In [17]:
# Plots the distribution between win rates and submission rates

# Generates the chart object by instanciating a plotly express histogram
fig = px.scatter(df, x='win_ratio', y='sub_win_ratio', hover_name="name",
                 color=df['champion'].astype(str), labels={'color': 'Has title?'},
                size='total_wins', width=800, height=800)


fig.update_layout(
    title={
        'text': "Win ratio vs Submission ratio",
        'y':0.95,
        'x':0.48,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Win ratio", yaxis_title="Submission ratio"

)

# Edit legend labels for understandability
newnames = {'0':'No', '1': 'Title winner'}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name]))

# Display observation under title explaining bubble size
legend_annotations = [
    dict(
        x=0.52,
        y=1,
        xref="paper",
        yref="paper",
        text=f"(Bubble size proportional to total wins)",
        showarrow=False,
        font=dict(size=12),
    )]

fig.update_layout(annotations=legend_annotations)


fig.show()

Marcelo Garcia, Gordon Ryan and Roger Gracie standing out as usual, with both submission and win rates considerably higher than their peers'.

### How is performance related to total number of titles?

In [18]:
# Plots the distribution between win rates and submission rates

# Generates the chart object by instanciating a plotly express histogram
fig = px.scatter(df, x='custom_score', y='sub_win_ratio', hover_name="name",
                 color=df['n_titles'].astype(str), labels={'color': 'Number of titles'},
                width=800, height=800, size='total_wins', log_x=True,
                color_discrete_sequence=px.colors.sequential.Sunsetdark, size_max=40)


fig.update_layout(
    title={
        'text': "Submission ratio by custom score",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Custom score", yaxis_title="Submission ratio"

)


# Display observation under title explaining bubble size
legend_annotations = [
    dict(
        x=0.542,
        y=1,
        xref="paper",
        yref="paper",
        text=f"(Bubble size proportional to total wins)",
        showarrow=False,
        font=dict(size=12),
    )]

fig.update_layout(annotations=legend_annotations)


fig.show()

The plot above highlights some of the most memorable athletes who were never champions such as Rousimar Palhares and Craig Jones, and also Joao Miyao and Ricardo Almeida.

### Do athletes from different generations prefer different targets? Does that influence ***win*** rates?

In [19]:
# Plots the distribution between win rate and year of debut (first appearance)


# Generates the chart object by instanciating a plotly express histogram
fig = px.scatter(df, x='debut_year', y='win_ratio', hover_name="name",
                 color=df['favorite_target'].astype(str),
                size='total_wins', size_max=30, opacity=0.6, labels={'color': 'Favorite target'})


fig.update_layout(
    title={
        'text': "Win rate by debut year",
        'y':0.95,
        'x':0.47,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Year of first appearance", yaxis_title="Win rate"

)

# Display observation under title explaining bubble size
legend_annotations = [
    dict(
        x=0.53,
        y=1.04,
        xref="paper",
        yref="paper",
        text=f"(Bubble size proportional to total wins)",
        showarrow=False,
        font=dict(size=12),
    )]

fig.update_layout(annotations=legend_annotations)


fig.show()

### Do athletes from different generations prefer different targets? Does that influence ***submission*** rates?

In [20]:
# Plots the distribution between submission rate and year of debut (first appearance)

# Generates the chart object by instanciating a plotly express histogram
fig = px.scatter(df, x='debut_year', y='sub_win_ratio', hover_name="name",
                 color=df['favorite_target'].astype(str),
                size='total_wins', size_max=30, opacity=0.6, labels={'color': 'Favorite target'})


fig.update_layout(
    title={
        'text': "Submission rate by debut year",
        'y':0.95,
        'x':0.47,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Year of first appearance", yaxis_title="Submission rate"

)

# Display observation under title explaining bubble size
legend_annotations = [
    dict(
        x=0.53,
        y=1.04,
        xref="paper",
        yref="paper",
        text=f"(Bubble size proportional to total wins)",
        showarrow=False,
        font=dict(size=12),
    )]

fig.update_layout(annotations=legend_annotations)


fig.show()

This plot highlights the impressive achievements of different athletes than the previous ones, since there's a comparison being made within athletes of the same "generation". Examples are Jean Jaques Machado, Dean Lister, Kade Ruotolo, Kron Gracie and Rousimar Palhares.  
Bianca Mesquita and Ana Carolina Vieira are also considerably ahead of their peers.

# Data preparation for unsupervised clustering<a id="section4"></a>

In [21]:
# Generates dummy columns for non numeric features, using pandas native
# implementation of one-hot encoding

dummies = pd.get_dummies(df[['favorite_target', 'most_vulnerable']], dtype=float)

# Creates a new dataframe
sdf = pd.concat([df, dummies], axis=1)
sdf.drop(columns=['name', 'favorite_target', 'most_vulnerable', 'main_weight_class'], inplace=True)
sdf.head()

Unnamed: 0,win_ratio,total_fights,sub_win_ratio,point_win_ratio,decision_win_ratio,n_editions_competed,scored_points_per_fight,suffered_points_per_fight,fights_per_edition,n_weight_classes,...,favorite_target_Arm,favorite_target_Leg,favorite_target_Neck,favorite_target_No specific target,favorite_target_Other/Unknown,most_vulnerable_Arm,most_vulnerable_Leg,most_vulnerable_Neck,most_vulnerable_No specific target,most_vulnerable_Other/Unknown
0,0.454545,11.0,0.181818,0.181818,0.090909,7.0,-0.272727,0.0,1.571429,2,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.75,8.0,0.125,0.25,0.25,2.0,0.125,-0.375,4.0,1,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.7,10.0,0.1,0.4,0.2,4.0,2.3,-0.2,2.5,1,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.625,8.0,0.375,0.125,0.125,4.0,-0.75,0.5,2.0,1,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.692308,13.0,0.153846,0.153846,0.384615,8.0,-0.384615,-0.461538,1.625,1,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [22]:
# Scales the data to the 0,1 range to prevent distortions and improve training

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(sdf)
sdf = pd.DataFrame(scaled_data, columns=sdf.columns)

**Principal Component Analysis (PCA)** is a technique commonly used in data problems to simplify complex datasets by **reducing their dimensionality**.  

The goal of **PCA** is to do so while capturing the **maximum amount of information** from the original dataset and **minimizing the loss of important patterns, trends, etc**.
By reducing the dimensionality, data can be better visualized and interpreted, and the use of some machine learning techniques can be simplified.  

**PCA** works by calculating **eigenvectors and eigenvalues** from the **covariance matrix** of dataset. This can determine the "directions" of maximum variance among the data points, which in turn can define new axis (artificial features calculated from the original ones) called **Principal Components** onto which the **data points are projected**.

In [23]:
# Creates three new features based on transformations from the original ones
# Principal component analysis (PCA) minimizes loss of variation when reducing dimensions

three_pca = PCA(n_components=3)
three_pca_data = three_pca.fit_transform(sdf)

explained_var = three_pca.explained_variance_ratio_

# Displays the amount of variation each principal component accounts for
for i, ratio in enumerate(explained_var):
    print(f"Explained Variance Ratio for Component {i+1}: {ratio:.4f}")

Explained Variance Ratio for Component 1: 0.2741
Explained Variance Ratio for Component 2: 0.2040
Explained Variance Ratio for Component 3: 0.0978


The results above indicate how much of the dataset's variance can be accounted for by each of the Principal Components generated

# Clustering <a id="section5"></a>

***K-Means*** is a popular algorithm used for **unsupervised** learning in data science. It's purpose is to partition a dataset into distinct groups (called **clusters**) based on similarity of data points. These similarities are not always obvious when dealing only with traditional statistical analysis.  

The algorithm works in successive iterations to **minimize within cluster (intra-cluster) variance while maximizing inter-cluster variance.  


The **initialization** of K-Means consists of **randomly electing K cluster centroids**. **K** is its **main parameter**, which must be chosen in a separate analysis.  

At each iteration of **K-Means**, data points are assigned to the cluster of their **nearest centroid**. The most common distance metric for this is Euclidian distance. This assigned is followed by an **update on centroids positions**, which are calculated as the **means of centroid members**.
After the update on centroids, another iteration of the **assignment-update loop** begins. This process is repeated until centroid positions stop changing or after a number of iterations previously specified.  

It's worth noting that the results of **K-Means** are sensitive to the **random initialization** centroid positions. In order to mitigate this, it's important to repeat the process several times with ***different starting conditions.

**WCSS** is a common metric to statistically evaluate KMeans. It's calculated as the ***Within Cluster Sum of Squares***. By minimizing WCSS, KMeans can find clusters with elements who are 'closer' to each other.


In [24]:
# Tests several values for the amount of clusters generated
# Plots the WCSS for each of these clusters number

wcss = []
for i in range(1, 10):
    kmeans = KMeans(n_clusters = i, init = "k-means++", max_iter = 500, n_init = 10, random_state = 123)
    kmeans.fit(three_pca_data)
    wcss.append(kmeans.inertia_)
    
fig = go.Figure(data = go.Scatter(x = [1,2,3,4,5,6,7,8,9,10], y = wcss))


fig.update_layout(title='WCSS vs. Number of clusters',
                   xaxis_title='Clusters',
                   yaxis_title='WCSS')

fig.update_layout(title_x=0.49)

fig.show()

Since there are no easily identifiable "*elbows*" in the chart above, another metric is used to determine the optimal number of clusters.

The **Silhouette** score represents how well data points are 'fitted' to their cluster. Negative values represent elements in the 'wrong' cluster. It's calculated using the distance between the resulting clusters. By maximizing distance between different clusters, data tends to be better grouped.

In [25]:
# Tests several numbers of clusters and scores them using the silhouette score metric.
# Plots the silhouette score for different numbers of clusters

silhouette_scores = []

for i in range(2, 10):
    kmeans = KMeans(n_clusters=i, init="k-means++", max_iter=500, n_init=10, random_state=123)
    kmeans.fit(three_pca_data)
    labels = kmeans.labels_
    silhouette_avg = silhouette_score(three_pca_data, labels)
    silhouette_scores.append(silhouette_avg)


fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(2, 10)), y=silhouette_scores, mode='lines', name='Silhouette Score'))

fig.update_layout(
    title='Silhouette Score vs. Number of clusters',
    xaxis_title='Number of Clusters',
    yaxis_title='Score'
)

fig.update_layout(title_x=0.49)

fig.show()

Since the maximum Silhouette score was found for ***n_clusters = 6***, that's the number of clusters used to train and use the model.

In [26]:
# Trains the model by fitting it to the data
kmeans = KMeans(n_clusters = 6, init="k-means++", max_iter = 500, n_init = 10)
identified_clusters = kmeans.fit_predict(three_pca_data)

# Creates a dataframe containing the cluster label
data_with_clusters = df.copy()
data_with_clusters['Cluster'] = identified_clusters

In [27]:
three_pca_df = pd.DataFrame(
    data=three_pca_data, 
    columns=['PC1', 'PC2', 'PC3'])

# Adds the artificial features generated by PCA to the new dataframe
for col in three_pca_df.columns:
    data_with_clusters[col] = three_pca_df[col]

### How are the clusters distributted along the 3 axis generated by PCA?

In [28]:
# Plots the points (athletes) on the 3D space defined by the 3 PCA features

fig = px.scatter_3d(data_with_clusters, x='PC1', y='PC2', z='PC3', labels={'color': 'Cluster'},
                    hover_name="name", opacity=0.8, size_max=40, width=800, height=800,
                   color=data_with_clusters['Cluster'].astype(str),
                    color_discrete_sequence=px.colors.qualitative.Dark2, 
                   )
fig.show()

The graph above allow for clear visualization of the distribution of athletes among the PCA features. This is the 3D space that maximizes the variability of the data set and makes it easier to visualize clusters.  
This kind of visualization, with such clear identification of clusters, could be built thanks to the use of PCA.

### How are athletes distributted among the KMeans clusters?

In [29]:
# Calculate the number of athletes in each cluster
cluster_counts = data_with_clusters['Cluster'].value_counts()

total_athletes = len(data_with_clusters)

# Calculate the percentage of total for each cluster
percentage_total = cluster_counts / total_athletes * 100

# Sort cluster counts and percentage total in descending order
cluster_counts = cluster_counts.sort_index(ascending=False)
percentage_total = percentage_total.sort_index(ascending=False)

fig = px.bar(x=cluster_counts.index, y=cluster_counts.values,
             color=cluster_counts.index.astype(str),
             color_discrete_sequence=px.colors.qualitative.Dark2)

fig.update_traces(hovertemplate='Cluster %{x}: %{y} Athletes<br>Total: %{y}')
# Add percentage of total as static text inside each bar
for x, y, text in zip(cluster_counts.index, cluster_counts.values, percentage_total.values):
    fig.add_annotation(
        x=x,
        y=y,
        text=f'{text:.2f}%',
        showarrow=False,
        font=dict(color='black', size=12),
        yshift=10
    )

# Customize the chart layout
fig.update_layout(
    title='Number of athletes in each cluster',
    xaxis_title='Cluster',
    yaxis_title='Number of athletes',
    showlegend=False
)

fig.update_layout(title_x=0.49)



fig.show()

The bar plot above shows that the clusters are not balanced, with vastly overrrepresented and underrepresented clusters.

### How do athletes from different clusters perform?

In [30]:
# Plots the distribution between win ratio and match importance for different clusters
data_with_clusters.sort_values(by='Cluster', inplace=True)
fig = px.scatter(data_with_clusters, x='avg_match_importance', y='win_ratio', hover_name="name",
                 color=data_with_clusters['Cluster'].astype(str), labels={'color': 'Cluster'},
                size='total_wins', size_max=40, opacity=0.6)


fig.update_layout(
    title={
        'text': "Win ratio by average match importance for fighters",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    xaxis_title="Match importance, 0 being the first round, 5 finals and 6 superfights", yaxis_title="Win ratio"

)

legend_annotations = [
    dict(
        x=0.546,
        y=1.04,
        xref="paper",
        yref="paper",
        text=f"(Bubble size proportional to total wins)",
        showarrow=False,
        font=dict(size=12),
    )]
fig.update_layout(annotations=legend_annotations)

# Modify the tick names in the x-axis
custom_labels = {
'0':'Other',
'1':'Quarterfinals',
'2':'Semifinals',
'3':'3rd place',
'4':'Finals',
'5':'Superfight'
}

fig.update_xaxes(ticktext=list(custom_labels.values()),
                 tickvals=list(custom_labels.keys()))


fig.show()

In [31]:

# Calculate the sum of 'n_titles' for each cluster
sum_n_titles = data_with_clusters.groupby('Cluster')['n_titles'].sum().reset_index()


# Add missing clusters with sum of n_titles equal to 0
all_clusters = data_with_clusters['Cluster'].unique()
missing_clusters = list(set(all_clusters) - set(sum_n_titles['Cluster']))

missing_data = pd.DataFrame({'Cluster': missing_clusters, 'n_titles': 0})
sum_n_titles = pd.concat([sum_n_titles, missing_data])

# Sort the DataFrame by cluster values
sum_n_titles = sum_n_titles.sort_values('Cluster')

fig = px.bar(sum_n_titles, x='Cluster', y='n_titles', color=sum_n_titles['Cluster'].astype(str),
             color_discrete_sequence=px.colors.qualitative.Dark2,
            hover_name=sum_n_titles['n_titles'])


# Customize the chart layout
fig.update_layout(
    title='Total of titles in each cluster',
    xaxis_title='Cluster',
    yaxis_title='Number of titles',
    showlegend=False
)


# Hide redundant legend since it displays the same information as x axis
fig.update_layout(showlegend=False)
fig.update_layout(title_x=0.49)



fig.show()

It can be seen above that 3 of the clusters a total of 0 titles combined. Further investigation could determine which (if any) specific fighter stats are strongly related to this and thus to title winning in general.

# Matches dataset visualization<a id="section6"></a>

Here another dataset crafted from the same original data source is imported. This data is organized differently, with each recorded match being a row and features being stats on individual matches.  

It's available at https://www.kaggle.com/datasets/albucathecoder/adcc-matches  

From this point forward the analysis and visualizations will be focused on this dataset.

In [32]:
mdf = pd.read_csv('/kaggle/input/adcc-matches/adcc_matches.csv')
mdf.head()

Unnamed: 0,victory_method,submission,winner_points,loser_points,female,year,absolute,weight_class,importance,total_points,submission_target,winner_name,loser_name
0,DECISION,,-1,-1,0,2011,1,,1,2,,Murilo Santana,Vinicius Magalhaes
1,SUBMISSION,Kimura,-1,-1,0,2022,0,3.0,0,2,Arm,Nicholas Meregali,Henrique Cardoso
2,DECISION,,-1,-1,0,2022,0,3.0,1,2,,Nicholas Meregali,Yuri Simoes
3,POINTS,,0,0,0,2022,0,3.0,3,0,,Nicholas Meregali,Rafael Lovato Jr
4,POINTS,,6,2,0,2022,1,,1,8,,Nicholas Meregali,Giancarlo Bodoni


### Does the way matches are won change over the years?

In [33]:
# Calculate likelihood for each victory type
likelihood_mdf = mdf.groupby(['year', 'victory_method']).size().div(mdf.groupby('year').size(), level='year').reset_index(name='likelihood')
likelihood_mdf['likelihood'] *= 100

# Plotlines using plotly express
fig = px.line(likelihood_mdf, x='year', y='likelihood', color='victory_method',
              hover_name='victory_method',
              color_discrete_sequence=px.colors.qualitative.Vivid)


fig.update_yaxes(title_text='Victory method (%)')
fig.update_xaxes(title_text='Year')

fig.update_layout(
    legend=dict(x=0.75, y=0.99,bgcolor='rgba(0, 0, 0, 0)'),
    legend_title_text="Win by", title_x=0.49,
    title_text='Probability of each victory type over the years'

)


fig.show()

Only 1998, 2005 and 2007 editions had the most of its matches ended by submission, which is commonly associated with the most exciting matches to watch.  
Decision victory (when there's no submission and no points difference between the fighters) had a surge from 2011 to 2015 and it's been since stable at a considerable likelihood level.  
2000 edition standout with more than 70% of matches being decided by points.

In [34]:
# Calculate likelihood for each category over the years and convert to percentages
likelihood_mdf = mdf.groupby(['year', 'victory_method']).size().div(mdf.groupby('year').size(), level='year').reset_index(name='likelihood')
likelihood_mdf['likelihood'] *= 100

# Create stacked bar chart using plotly express
fig = px.bar(likelihood_mdf, x='year', y='likelihood', color='victory_method',
            labels={'year': 'Year', 'likelihood': 'Probability (%)', 'victory_method': 'Victory Method'},
            title='Proportion of Victory Methods over the Years',
            hover_data={'year': True, 'likelihood': ':.2f', 'victory_method': True},
            barmode='stack',
            color_discrete_sequence=px.colors.qualitative.Bold
            )

fig.update_layout(title_x=0.49)


fig.show()

The plot above is an alternative way of visualizing the same information from the previous one.  
It shows that up to 2011, the proportions of submission and points victories was going back and forth against one another.  
With the rise of decision victories, this got more complex.

### Do athletes targets for submissions change over the years?

In [35]:
# Calculate likelihood for each category over the years and convert to percentages
likelihood_mdf = mdf.groupby(['year', 'submission_target']).size().div(mdf.groupby('year').size(), level='year').reset_index(name='likelihood')
likelihood_mdf['likelihood'] *= 100

# Create line chart using plotly express
fig = px.line(likelihood_mdf, x='year', y='likelihood', color='submission_target',
              hover_name='submission_target')


fig.update_yaxes(title_text='Submission ratio (%)')
fig.update_xaxes(title_text='Year')

fig.update_layout(
    legend=dict(x=0.75, y=0.99,bgcolor='rgba(0, 0, 0, 0)'),
    legend_title_text="Target", title_x=0.47,
    title_text='Submission ratio for each target body part over the years'
)


fig.show()

Since the 1998 ADCC, arm submissions are becoming more rare.  
With the exception of this year and 2011, neck attacks have always been the most common way to end matches prematurely.  
Over the years there were wild changes in the way leg attacks are used in submission grappling.  
As one group of fighters popularizes certain systems of leg attack, the general athlete pool pick up on the new trends and the playing field eventually gets levelled.  
  
  
In a way, the evolution of submission attacks over subsequent ADCC editions can be seen as an arms race of sorts, with constant development of techniques and systems to specifically counters what has recently been successful.

### How did the popularity of heel hooks changed?

In [36]:
# Get only matches that ended by heel hook
heel_hook_mdf = mdf[mdf['submission'].isin(['Inside heel hook', 'Heel hook', 'Outside heel hook'])]

# Group the DataFrame to calculate the count of submissions by year and submission type
grouped_mdf = heel_hook_mdf.groupby(['year']).size().reset_index(name='count')

# Create the plot using Plotly Express
fig = px.bar(grouped_mdf, x='year', y='count',
             labels={'year': 'Year', 'count': 'Number of Heel Hooks'},
             title='Number of Heel Hook submissions over the Years',
             hover_data={'year': True, 'count': True})

fig.update_layout(title_x=0.49)

fig.show()

Heel hooks have transitioned from a rare occurance to one of the most used leg attacks in submission grappling, with a surge in 2011 followed by its popularization among the athlete pool.

In [37]:
# Get only matches that ended by knee bar
heel_hook_mdf = mdf[mdf['submission'] == 'Kneebar']

# Group the DataFrame to calculate the count of submissions by year and submission type
grouped_mdf = heel_hook_mdf.groupby(['year']).size().reset_index(name='count')

# Create the plot using Plotly Express
fig = px.bar(grouped_mdf, x='year', y='count',
             labels={'year': 'Year', 'count': 'Number of Kneebars'},
             title='Number of Kneebar submissions over the Years',
             hover_data={'year': True, 'count': True})

fig.update_layout(title_x=0.49)


fig.show()

Kneebars, on the other hand, are becoming more rare since 2001.

### Who are the most accomplished submission artists in each ADCC edition?

In [38]:
# Get only matches that ended by submission
submissions_mdf = mdf[mdf['victory_method'] == 'SUBMISSION']


# Group data to calculate the count of submissions for each fighter and year
grouped_mdf = submissions_mdf.groupby(['year', 'winner_name']).size().reset_index(name='count')

# Find the fighter with the most submission wins in each year
max_submissions = grouped_mdf.groupby('year').apply(lambda x: x.loc[x['count'].idxmax()]).reset_index(drop=True)

# Plot the horizontal bars
fig = go.Figure(data=go.Bar(
    y=max_submissions['year'],
    x=max_submissions['count'],
    text=max_submissions['winner_name'],
    hovertemplate=
    '<b>Year</b>: %{y}<br>' +
    '<b>Fighter</b>: %{text}<br>' +
    '<b>Submissions</b>: %{x}<extra></extra>',
    orientation='h',
    textposition='inside',  # Set text position inside the bars
    textfont={'size': 14},  # Adjust the text font size
    marker_color='steelblue',  # Customize the bar color
))


fig.update_layout(
    title='Fighter with Most Submission Wins in Each Year',
    xaxis_title='Number of Submissions',
    yaxis_title='Year',
    height = 800,
    title_x=0.48,
)


fig.show()

### How do submission rates vary between different stages of the competition?

In [39]:
# Get only matches that ended in submission
submission_mdf = mdf[mdf['victory_method'] == 'SUBMISSION']

# Calculate submission ratio for each 'importance' value
submission_ratio = submission_mdf.groupby('importance').size() / mdf.groupby('importance').size()
submission_ratio *= 100  # Convert to percentage

# Create a DataFrame with 'importance' and 'submission_ratio' columns
submission_ratio_mdf = pd.DataFrame({'importance': submission_ratio.index, 'submission_ratio': submission_ratio.values})

# Plot bars
fig = px.bar(data_frame=submission_ratio_mdf, x=submission_ratio_mdf['importance'].astype(str),
             y='submission_ratio',labels={'importance': 'Importance',
                                          'submission_ratio': 'Submission Ratio (%)',
            'x':'Match importance, 0 being the first round, 5 finals and 6 superfights'},
             title='Probability of Submission by match importance',
             text=submission_ratio.values.round(2),  # Display percentage values inside bars
             hover_data={'importance': True, 'submission_ratio': ':.2f'},
            color_continuous_scale=px.colors.sequential.Sunsetdark,
            color='importance', hover_name='importance')


# Modify the tick names in the x-axis
custom_labels = {
'0':'Other',
'1':'Quarterfinals',
'2':'Semifinals',
'3':'3rd place',
'4':'Finals',
'5':'Superfight'
}

fig.update_xaxes(ticktext=list(custom_labels.values()),
                 tickvals=list(custom_labels.keys()))

fig.update_traces(textposition='inside')  # Position the text inside the bars
fig.update_layout(showlegend=False, coloraxis_showscale=False,
                 title_x=0.48)


fig.show()

The chart above clearly shows that the first round has the most risk of submissions.  
The skill variance in this stage is the greatest, since all athletes are pooled together initially, which can explain the higher probability of submissions occuring.  
  
Superfights, on the other hand, have considerably lower submission rates. These matches happen only between the most skilled competitors, who have the most to lose. Not only are more skilled competitors more difficult to submit, they generally tend to play more defensively in these situations, which might explain these lower submission rates.
  
Semifinals display lower-than-expected submission rates, which might indicate athletes are more cautious at this stage, being faced with the prospect of competing in a final or losing the chance to do so.

In [40]:
# Get only matches the ended in decision
decision_mdf = mdf[mdf['victory_method'] == 'DECISION']

# Calculate decision ratio for each 'importance' value
decision_ratio = decision_mdf.groupby('importance').size() / mdf.groupby('importance').size()
decision_ratio *= 100  # Convert to percentage

# It's easier to plot this exact information by creating a specific dataframe
decision_ratio_mdf = pd.DataFrame({'importance': decision_ratio.index, 'decision_ratio': decision_ratio.values})

# Plot the bars with plotly express
fig = px.bar(data_frame=decision_ratio_mdf, x=decision_ratio_mdf['importance'].astype(str),
             y='decision_ratio',labels={'importance': 'Importance',
                                          'Decision victory ratio (%)': 'decision Ratio (%)',
            'x':'Match importance, 0 being the first round, 5 finals and 6 superfights'},
             title='Probability of Decision by match importance',
             text=decision_ratio.values.round(2),  # Display percentage values inside bars
             hover_data={'importance': True, 'decision_ratio': ':.2f'},
            color_continuous_scale=px.colors.sequential.Sunsetdark,
            color='importance')


# Modify the tick names in the x-axis
custom_labels = {
'0':'Other',
'1':'Quarterfinals',
'2':'Semifinals',
'3':'3rd place',
'4':'Finals',
'5':'Superfight'
}

fig.update_xaxes(ticktext=list(custom_labels.values()),
                 tickvals=list(custom_labels.keys()))

fig.update_traces(textposition='inside') 
fig.update_layout(showlegend=False, coloraxis_showscale=False,
                 yaxis_title="Probability of victory by decision (%)",
                 title_x=0.48)


fig.show()

The same trends can be seen here, with semifinals and superfights boasting high decision victory rates.  
This means that not only are athletes more cautious about getting submitted, but also about conceding points in these situations.

### Who are the most accomplished submission specialists of all time?

In [41]:
# Get counts of submissions by fighter for each target
winner_counts = mdf.groupby(['submission_target', 'winner_name']).size().reset_index(name='win_count')

# Find the most frequent winner for each submission target
top_winners = winner_counts.groupby('submission_target')['win_count'].idxmax()
most_frequent_winners = winner_counts.loc[top_winners]

# Group the data by submission target and loser name to get the count of losses for each combination
loser_counts = mdf.groupby(['submission_target', 'loser_name']).size().reset_index(name='loss_count')

# Find the most frequent loser for each submission target
top_losers = loser_counts.groupby('submission_target')['loss_count'].idxmax()
most_frequent_losers = loser_counts.loc[top_losers]

# Create a horizontal bar chart using Plotly
fig = go.Figure()

# Add green bars for the most frequent winners
fig.add_trace(go.Bar(
    y=most_frequent_winners['submission_target'],
    x=most_frequent_winners['win_count'],
    orientation='h',
    name='Most Frequent Winner',
    marker=dict(color='seagreen'),
    text=most_frequent_winners['winner_name'],
    textposition='inside', opacity=0.8,
    textfont=dict(color='white', size=14)
))

# Add red bars for the most frequent losers
fig.add_trace(go.Bar(
    y=most_frequent_losers['submission_target'],
    x=most_frequent_losers['loss_count'],
    orientation='h',
    name='Most Frequent Loser',
    marker=dict(color='orangered'),
    text=most_frequent_losers['loser_name'],
    textposition='inside', opacity=0.8,
    textfont=dict(color='white', size=14)
))


fig.update_layout(
    title='Most Frequent Winner and Loser for Each Submission Target',
    xaxis_title='Number of matches',
    yaxis_title='Submission Target',
    barmode='relative',
    bargap=0.2,
    legend=dict(
        x=0.7,
        y=0.95,
        bgcolor='rgba(0,0,0,0)'
    ),
    showlegend=True,
    title_x=0.45,
)

fig.show()

Marcelo Garcia greatly stands out for the amount of matches ended by neck attack. Not only is he notorious for his use of guillotines, his seated guard systems for submission grappling are highly influential to this day.  

Dean Lister's known for introducing many of the leg attacks that later got popularized by teams focusing on the area, and has achieved many submissions with these attacks.  
  
Comparatively, arm attacks are rarer as a choice for specialization and its submissions seem to be more distributted among the athletes.

### Who won the most matches by each victory method?

In [42]:
# Exclude minor exceptions from the analysis
victories_mdf = mdf[(mdf['victory_method'] != 'INJURY') & (mdf['victory_method'] != 'DESQUALIFICATION')].copy()

# Group the data by submission target and winner name to get the count of wins for each combination
winner_counts = victories_mdf.groupby(['victory_method', 'winner_name']).size().reset_index(name='win_count')

# Find the most frequent winner for each submission target
top_winners = winner_counts.groupby('victory_method')['win_count'].idxmax()
most_frequent_winners = winner_counts.loc[top_winners]

# Group the data by submission target and loser name to get the count of losses for each combination
loser_counts = victories_mdf.groupby(['victory_method', 'loser_name']).size().reset_index(name='loss_count')

# Find the most frequent loser for each submission target
top_losers = loser_counts.groupby('victory_method')['loss_count'].idxmax()
most_frequent_losers = loser_counts.loc[top_losers]

# Create a horizontal bar chart using Plotly
fig = go.Figure()

# Add green bars for the most frequent winners
fig.add_trace(go.Bar(
    y=most_frequent_winners['victory_method'],
    x=most_frequent_winners['win_count'],
    orientation='h',
    name='Most Frequent Winner',
    marker=dict(color='seagreen'),
    text=most_frequent_winners['winner_name'],
    textposition='inside', opacity=0.8,
    textfont=dict(color='white', size=16)
))

# Add red bars for the most frequent losers
fig.add_trace(go.Bar(
    y=most_frequent_losers['victory_method'],
    x=most_frequent_losers['loss_count'],
    orientation='h',
    name='Most Frequent Loser',
    marker=dict(color='orangered'),
    text=most_frequent_losers['loser_name'],
    textposition='inside', opacity=0.8,
    textfont=dict(color='white', size=16)
))

# Set the layout
fig.update_layout(
    title='Most frequent Winner and Loser of each Victory method',
    xaxis_title='Number of matches',
    yaxis_title='Vcitory by',
    barmode='relative',
    bargap=0.2,
    legend=dict(
        x=0.7,
        y=0.02,
        bgcolor='rgba(0,0,0,0)'
    ),
    showlegend=True,
    title_x=0.47,
)


fig.show()

Again, Marcelo Garcia's career stands out for his ability to submit oponents in high level competition.  
  
Andre Galvao, on the other hand, seems like a more conservative player, knowing how to use the points system to win matches and achieve competition success.

### Does weight influence which submissions are executed?

In [43]:

# Group the data by 'weight_class' and 'submission_target' and calculate the relative frequency
grouped_data = mdf.groupby(['weight_class', 'submission_target']).size().reset_index(name='count')
grouped_data['relative_frequency'] = grouped_data.groupby('weight_class')['count'].transform(lambda x: x / x.sum())

# Create the bar chart using Plotly Express
fig = px.bar(grouped_data, x='weight_class', y='relative_frequency',
             color='submission_target')

# Calculate the x-coordinate for each annotation
grouped_data['cumulative_relative_frequency'] = grouped_data.groupby('weight_class')['relative_frequency'].cumsum() - 0.5 * grouped_data['relative_frequency']
grouped_data['x_annotation'] = grouped_data['weight_class']

# Add annotations to display the submission target names
for _, row in grouped_data.iterrows():
    fig.add_annotation(
        x=row['x_annotation'],
        y=row['cumulative_relative_frequency'],
        text=row['submission_target'],
        showarrow=False,
        font=dict(color='white', size=12),
        textangle=0,
        xanchor='center',
        yanchor='middle'
    )

fig.update_traces(hovertemplate='%{x} Weight Class: %{y} submission ratio')

fig.update_layout(
    title='Target body part of successful submissions',
    xaxis_title='Weight class',
    yaxis_title='Ratio to total submissions in that weight class',
    barmode='relative',
    bargap=0.2,
    showlegend=False,
    title_x=0.5,
)

# Modify the tick names in the x-axis
custom_labels = {
'0':'66kg (60kg for females)',
'1':'77kg',
'2':'88kg',
'3':'99kg',
'4':'+99kg (+60kg for females)',
}

fig.update_xaxes(ticktext=list(custom_labels.values()),
                 tickvals=list(custom_labels.keys()))


fig.show()

As shown in the chart above, neck submissions are less common in matches between heavier athletes.    

Since these submissions are more likely to happen from dominant positions while leg and arm attacks are available from a wider range of situations, one can assume this trend is due to the fact that stronger, larger fighters are less likely to be controlled.  
  
  
Legs and arms are more exposed than the neck in the majority of positions, so athletes can target them without requiring to first secure a more dominant position.

### Are heavier athletes less likely to submit each other in general?

In [44]:
# Calculate the probability of 'SUBMISSION' for each weight class
total_counts = mdf.groupby('weight_class').size().reset_index(name='total_count')
submission_counts = mdf[mdf['victory_method'] == 'SUBMISSION'].groupby('weight_class').size().reset_index(name='submission_count')
probability_data = pd.merge(total_counts, submission_counts, on='weight_class', how='left')
probability_data['probability'] = (probability_data['submission_count'] / probability_data['total_count']) * 100

# Create the bar chart using Plotly Express
fig = px.bar(probability_data, x='weight_class', y='probability',
             color='weight_class', color_continuous_scale=px.colors.sequential.Sunsetdark)

# Add static text annotations of probability values inside each bar
for i, row in probability_data.iterrows():
    fig.add_annotation(
        x=row['weight_class'],
        y=row['probability']-1.2,
        text=f'<b>{np.round(row["probability"], 2)}%</b>',
        showarrow=False,
        font=dict(color='lightsteelblue', size=14),
        textangle=0,
        xanchor='center',
        yanchor='middle'
    )

fig.update_traces(hovertemplate='%{x} Weight Class: %{y:.2f}% submission probability')
fig.update_layout(
    title='Probability of victory by any submission in different weight classes',
    xaxis_title='Weight class',
    yaxis_title='Submission probability (%)',
)

# Modify the tick names in the x-axis
custom_labels = {
'0':'66kg (60kg for females)',
'1':'77kg',
'2':'88kg',
'3':'99kg',
'4':'+99kg (+60kg for females)',
}

fig.update_xaxes(ticktext=list(custom_labels.values()),
                 tickvals=list(custom_labels.keys()))
fig.update_layout(showlegend=False, coloraxis_showscale=False,title_x=0.5)


fig.show()

There are no significant differences that can be seen from the graph above that would imply trends in how submission rates behave in relation to weight

### How are open weight matches (absolute division) different than regular ones?

In [45]:
# Group the data by 'weight_class' and 'submission_target' and calculate the relative frequency
grouped_data = mdf.groupby(['absolute', 'submission_target']).size().reset_index(name='count')
grouped_data['relative_frequency'] = grouped_data.groupby('absolute')['count'].transform(lambda x: x / x.sum())

# Create the bar chart using Plotly Express
fig = px.bar(grouped_data, x='absolute', y='relative_frequency',
             color='submission_target')

# Calculate the x-coordinate for each annotation
grouped_data['cumulative_relative_frequency'] = grouped_data.groupby('absolute')['relative_frequency'].cumsum() - 0.5 * grouped_data['relative_frequency']
grouped_data['x_annotation'] = grouped_data['absolute']

# Add annotations to display the submission target names
for _, row in grouped_data.iterrows():
    fig.add_annotation(
        x=row['x_annotation'],
        y=row['cumulative_relative_frequency'],
        text=row['submission_target'],
        showarrow=False,
        font=dict(color='white', size=12),
        textangle=0,
        xanchor='center',
        yanchor='middle'
    )


fig.update_layout(
    title='Submission target',
    xaxis_title='',
    yaxis_title='Ratio to total submissions in match type',
    barmode='relative',
    bargap=0.2,
    showlegend=False,
)

fig.update_layout(
    title_x=0.5,
    xaxis = dict(
        tickmode = 'array',
        tickvals = [0, 1],
        ticktext = ['Weight class divisions', 'Open weight (absolute) division']
    )
)



fig.show()

It can be seen above that arm and neck attacks are less common in matches in the absolute division.  

This most likely reflects a trend of more frequent leg exchanges and entanglements in these matches, which are commonly pursued as an equalizing tactic by smaller fighters against larger ones.

In [46]:
# Calculate likelihood for each category over the years and convert to percentages
likelihood_mdf = mdf.groupby(['absolute', 'victory_method']).size().div(mdf.groupby('absolute').size(), level='absolute').reset_index(name='likelihood')
likelihood_mdf['likelihood'] *= 100

# Create stacked bar chart using plotly express
fig = px.bar(likelihood_mdf, x='absolute', y='likelihood', color='victory_method',
            labels={'likelihood': 'Probability (%)', 'victory_method': 'Victory by'},
            title='Victory types on Open weight Vs. Weight class divisions',
            hover_data={'absolute': True, 'likelihood': ':.2f', 'victory_method': True},
            barmode='stack',
            color_discrete_sequence=px.colors.qualitative.T10
            )

fig.update_layout(
    title_x=0.45,
    xaxis_title='',
    xaxis = dict(
        tickmode = 'array',
        tickvals = [0, 1],
        ticktext = ['Weight class divisions', 'Open weight (absolute) division']
    )
)

fig.show()

In terms of victory method, on the other hand, the absolute division shows extremely similar results as the weight class ones.  
One small yet significant difference that can be observed is in the probabilities of injury and disqualification, both more likely to happen in open weight matches.

## Thank you for visiting this notebook!<a id="section7"></a>

Help spread information by sharing it with anyone you know might be interested in either submission grappling or data in general.

If I missed anything, please let me know!  

Also, if there's any specific visualization you'd like to see used with this data or if you have a different specific tool or library to suggest, feel free to leave a comment :)