<a href="https://colab.research.google.com/github/anihab/dnaTokenization/blob/main/results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import json
import pandas as pd
import plotly.express as px
import plotly.subplots as sp
import plotly.graph_objects as go

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [14]:
# define color scheme
colors = ['#535C7B', '#876B92', '#d1e6e5', '#BF7A97', '#EA918E', '#59adf6', '#FFB480', '#c780e8']
model_color_map = {'dnabert1_6': '#535C7B', 'dnabert2': '#876B92', 'nt_NT_500_1000g': '#FFB480', 'hyenadna_1k': '#EA918E'}

##**Load and Process Data**

In addition to full results (`df`), create dataframes for:
*   Mean metrics across replications, with standard deviation (`mean_acc`, `mean_f1`, `mean_mcc`)
*   Merge self-reported results with our averaged results and results reported by other teams (`merged_df`)



In [4]:
# get our results
df = pd.read_csv('/content/drive/MyDrive/tokenization/data/finetune/results.csv')
df.head()

Unnamed: 0,model,task,task_category,task_benchmark,replicate_number,accuracy,f1,matthews_correlation,epoch
0,dnabert1_6,H3,epigenetic mark prediction (yeast),GUE,1,0.863059,0.862989,0.726318,3.0
1,dnabert1_6,H3K14ac,epigenetic mark prediction (yeast),GUE,1,0.717398,0.710967,0.421938,3.0
2,dnabert1_6,H3K36me3,epigenetic mark prediction (yeast),GUE,1,0.743979,0.736517,0.47768,3.0
3,dnabert1_6,H3K4me1,epigenetic mark prediction (yeast),GUE,1,0.714962,0.710911,0.422638,3.0
4,dnabert1_6,H3K4me2,epigenetic mark prediction (yeast),GUE,1,0.669599,0.648195,0.304856,3.0


In [12]:
# group by model and task, calculate mean statistics and std
mean_acc = df.groupby(['model', 'task', 'task_category', 'task_benchmark']).agg({'accuracy': ['mean', 'std']})
mean_acc.columns = ['mean', 'std']
mean_acc['mean'] = mean_acc['mean'].round(2)
mean_acc.reset_index(inplace=True)

mean_f1 = df.groupby(['model', 'task', 'task_category', 'task_benchmark']).agg({'f1': ['mean', 'std']})
mean_f1.columns = ['mean', 'std']
mean_f1['mean'] = mean_f1['mean'].round(2)
mean_f1.reset_index(inplace=True)

mean_mcc = df.groupby(['model', 'task', 'task_category', 'task_benchmark']).agg({'matthews_correlation': ['mean', 'std']})
mean_mcc.columns = ['mean', 'std']
mean_mcc['mean'] = mean_mcc['mean'].round(2)
mean_mcc.reset_index(inplace=True)

In [8]:
# get all average metrics across number of replications
average_metrics = df.groupby(['model', 'task', 'task_category', 'task_benchmark']).agg({
    'accuracy': 'mean',
    'f1': 'mean',
    'matthews_correlation': 'mean'
}).reset_index()

# get reported results
reported = pd.read_csv('/content/drive/MyDrive/tokenization/data/finetune/reported.csv')
reported['matthews_correlation'] = reported['matthews_correlation'] / 100

# merge self-reported results with our averaged results and results reported by other teams
ours = average_metrics
ours['reported_by'] = 'us'
merged_df = pd.merge(reported, ours, how='outer')
merged_df.head()

Unnamed: 0,model,reported_by,task,task_category,task_benchmark,accuracy,f1,matthews_correlation
0,dnabert1_6,dnabert,H3,epigenetic mark prediction (yeast),GUE,,,0.731
1,dnabert1_6,dnabert,H3K14ac,epigenetic mark prediction (yeast),GUE,,,0.4006
2,dnabert1_6,dnabert,H3K36me3,epigenetic mark prediction (yeast),GUE,,,0.4725
3,dnabert1_6,dnabert,H3K4me1,epigenetic mark prediction (yeast),GUE,,,0.4144
4,dnabert1_6,dnabert,H3K4me2,epigenetic mark prediction (yeast),GUE,,,0.3237


##**Full Results Table**

*   For inspecting the full results (`df`)
*   Columns are: Model, Task, Task Category, Task Benchmark, Replicate Number, Accuract, F1, Matthews Correlation, Epoch



In [13]:
df

Unnamed: 0,model,task,task_category,task_benchmark,replicate_number,accuracy,f1,matthews_correlation,epoch
0,dnabert1_6,H3,epigenetic mark prediction (yeast),GUE,1,0.863059,0.862989,0.726318,3.0
1,dnabert1_6,H3K14ac,epigenetic mark prediction (yeast),GUE,1,0.717398,0.710967,0.421938,3.0
2,dnabert1_6,H3K36me3,epigenetic mark prediction (yeast),GUE,1,0.743979,0.736517,0.477680,3.0
3,dnabert1_6,H3K4me1,epigenetic mark prediction (yeast),GUE,1,0.714962,0.710911,0.422638,3.0
4,dnabert1_6,H3K4me2,epigenetic mark prediction (yeast),GUE,1,0.669599,0.648195,0.304856,3.0
...,...,...,...,...,...,...,...,...,...
1299,nt_NT_500_1000g,human_enhancers_ensembl,,Genomic Benchmark,3,0.869026,0.868921,0.739777,4.0
1300,nt_NT_500_1000g,human_ensembl_regulatory,,Genomic Benchmark,3,0.945824,0.938274,0.876576,4.0
1301,nt_NT_500_1000g,human_nontata_promoters,,Genomic Benchmark,3,0.882646,0.881712,0.763517,4.0
1302,nt_NT_500_1000g,human_ocr_ensembl,,Genomic Benchmark,3,0.749199,0.747463,0.507695,4.0


Just Genomic Benchmark:

In [11]:
df[df['task_benchmark'] == 'Genomic Benchmark']

Unnamed: 0,model,task,task_category,task_benchmark,replicate_number,accuracy,f1,matthews_correlation,epoch
1235,hyenadna_1k,coding_vs_intergenomic,,Genomic Benchmark,1,0.899200,0.899198,0.798393,10.00
1236,hyenadna_1k,human_or_worm,,Genomic Benchmark,1,0.955500,0.955501,0.911006,10.00
1237,hyenadna_1k,human_enhancers_cohn,,Genomic Benchmark,1,0.672904,0.672870,0.345600,9.97
1238,hyenadna_1k,human_enhancers_ensembl,,Genomic Benchmark,1,0.870511,0.870489,0.741435,10.00
1239,hyenadna_1k,human_ensembl_regulatory,,Genomic Benchmark,1,0.928700,0.928230,0.836481,10.00
...,...,...,...,...,...,...,...,...,...
1299,nt_NT_500_1000g,human_enhancers_ensembl,,Genomic Benchmark,3,0.869026,0.868921,0.739777,4.00
1300,nt_NT_500_1000g,human_ensembl_regulatory,,Genomic Benchmark,3,0.945824,0.938274,0.876576,4.00
1301,nt_NT_500_1000g,human_nontata_promoters,,Genomic Benchmark,3,0.882646,0.881712,0.763517,4.00
1302,nt_NT_500_1000g,human_ocr_ensembl,,Genomic Benchmark,3,0.749199,0.747463,0.507695,4.00


##**Bar Charts**

1.   Mean MCC results by task for each model, with SD bars
2.   Mean Accuracy results by task for each model, with SD bars
3.   Mean F1 results by task for each model, with SD bars

In [None]:
task = 'H3'
benchmark = 'GUE'

bar = px.bar(mean_mcc[(mean_mcc['task'] == task) & (mean_mcc['task_benchmark'] == benchmark)],
             x='mean', y='model', color='model',
             color_discrete_map=model_color_map,
             text_auto=True,
             error_x='std', error_y='std',
             title=task)

bar.update_layout(title_x=0.5,
                  showlegend=False,
                  xaxis=dict(showline=True, linecolor='black', title=''),
                  yaxis=dict(showline=True, linecolor='black', title=''),
                  )
bar.update_traces(insidetextanchor='start', textfont_color='white', textfont_size=14)

bar.update_layout(bargap=0.1)
bar.update_xaxes(range = [0,1])
bar.show()

In [None]:
task = 'H3'
benchmark = 'GUE'

bar = px.bar(mean_acc[(mean_acc['task'] == task) & (mean_acc['task_benchmark'] == benchmark)],
             x='mean', y='model', color='model',
             color_discrete_map=model_color_map,
             text_auto=True,
             error_x='std', error_y='std',
             title=task)

bar.update_layout(title_x=0.5,
                  showlegend=False,
                  xaxis=dict(showline=True, linecolor='black', title=''),
                  yaxis=dict(showline=True, linecolor='black', title=''),
                  )
bar.update_traces(insidetextanchor='start', textfont_color='white', textfont_size=14)

bar.update_layout(bargap=0.1)
bar.update_xaxes(range = [0,1])
bar.show()

In [None]:
task = 'H3'
benchmark = 'GUE'

bar = px.bar(mean_f1[(mean_f1['task'] == task) & (mean_f1['task_benchmark'] == benchmark)],
             x='mean', y='model', color='model',
             color_discrete_map=model_color_map,
             text_auto=True,
             error_x='std', error_y='std',
             title=task)

bar.update_layout(title_x=0.5,
                  showlegend=False,
                  xaxis=dict(showline=True, linecolor='black', title=''),
                  yaxis=dict(showline=True, linecolor='black', title=''),
                  )
bar.update_traces(insidetextanchor='start', textfont_color='white', textfont_size=14)

bar.update_layout(bargap=0.1)
bar.update_xaxes(range = [0,1])
bar.show()

##**Full Bar Chart**

*  Mean results by task category for each model, with SD bars
*  Mean results by task for each model, with SD bars

In [None]:
# group by model and task category, calculate mean stats and std
avg_category_acc = df.groupby(['model', 'task_category']).agg({'accuracy': ['mean', 'std']})
avg_category_acc.columns = ['mean', 'std']
avg_category_acc['mean'] = avg_category_acc['mean'].round(2)
avg_category_acc.reset_index(inplace=True)

In [None]:
fig = px.bar(avg_category_acc, x='task_category', y='mean',
             color='model', color_discrete_map=model_color_map,
             barmode='group',
             text_auto=True,
             error_x='std', error_y='std',
             title='Average Accuracy by Task Category and Model')
fig.update_traces(insidetextanchor='start')
fig.show()

In [None]:
fig = px.bar(mean_acc, x='task', y='mean',
             color='model', color_discrete_map=model_color_map,
             barmode='group',
             text_auto=True,
             error_x='std', error_y='std',
             title='Average Accuracy by Task and Model')
fig.update_traces(insidetextanchor='start')
fig.show()

##**Scatter Plots on Full Results**

####Single Model

In [None]:
fig = px.scatter(df[df['model'] == 'dnabert1_6'],
                 x='matthews_correlation',
                 y='replicate_number',
                 color='task',
                 color_discrete_sequence=colors,
                 title='DNABERT1 MCC Results by Replication Number and Task (kmer=6)')

fig.update_traces(marker={'size': 15})
fig.show()

fig = px.scatter(df[df['model'] == 'dnabert1_6'],
                 x='matthews_correlation',
                 y='replicate_number',
                 color='task_category',
                 color_discrete_sequence=colors,
                 hover_data=['task'],
                 title='DNABERT1 MCC Results by Replication Number and Task Category (kmer=6)')
fig.update_traces(marker={'size': 15})
fig.show()

####All Models

Show all model summaries side-by-side

* x-axis - matthews correlation coefficient
* y-axis - replication number
* color the dots by task category


In [None]:
color_mapping = {
    'epigenetic mark prediction (yeast)': '#636efa',
    'transcription factor prediction (human)': '#EF553B',
    'transcription factor prediction (mouse)': '#00cc96',
    'promoter detection (human)': '#ab63fa',
    'core promoter detection (human)': '#FFA15A',
    'covid variant classification (virus)': '#19d3f3',
    'splice site detection (human)': '#FF6692',
    'enhancers (human)': 'pink'
}

fig = sp.make_subplots(rows=1, cols=3, subplot_titles=('DNABERT-1', 'DNABERT-2', 'NT-1000G (500M)'))

models = ['dnabert1_6', 'dnabert2', 'nt_NT_500_1000g']
for i, model in enumerate(models, start=1):
    filtered_df = df[df['model'] == model]
    for task_category in filtered_df['task_category'].unique():
        scatter = px.scatter(filtered_df[filtered_df['task_category'] == task_category],
                             x='matthews_correlation',
                             y='replicate_number',
                             color='task_category',
                             color_discrete_map=color_mapping,
                             hover_data=['task'],
                             title=f'{model} MCC Results by Replication Number and Task Category {task_category}')
        fig.add_trace(scatter.data[0], row=1, col=i)
        fig.update_layout(legend=dict(traceorder='normal'))
fig.show()

##**Scatter Plot: Matthews Correlation Self Reported vs. Experimental Results**

*   x-axis - self reported
*   y-axis - measured by us or other publications
*   color the dots by category

In [None]:
# filter data to single model
filtered_df = merged_df[merged_df['model'] == 'dnabert1_6']
self_reported = filtered_df[filtered_df['reported_by'] == 'dnabert']
other_reported = filtered_df[filtered_df['reported_by'] != 'dnabert']

# merge 'self_reported' and 'other_reported' to align the rows
aligned_df = pd.merge(self_reported[['reported_by', 'task', 'task_category', 'matthews_correlation']],
                      other_reported[['reported_by', 'task', 'task_category', 'matthews_correlation']],
                      on=['task', 'task_category'],
                      suffixes=('_self', '_other'))

# map reported category to opacity values
opacity_mapping = {'NT': 0.5, 'us': 1.0}
aligned_df['opacity'] = aligned_df['reported_by_other'].map(opacity_mapping)

# create a scatter plot
fig = px.scatter(aligned_df,
                 x='matthews_correlation_self',
                 y='matthews_correlation_other',
                 color='task_category',
                 color_discrete_sequence=colors,
                 opacity=aligned_df['opacity'],
                 hover_data=['task', 'reported_by_other'],
                 title='DNABERT1 MCC Comparison')
fig.update_traces(marker={'size': 15})
fig.show()

In [None]:
# filter data to single model
filtered_df = merged_df[merged_df['model'] == 'dnabert2']
self_reported = filtered_df[filtered_df['reported_by'] == 'dnabert']
other_reported = filtered_df[filtered_df['reported_by'] != 'dnabert']

# merge 'self_reported' and 'other_reported' to align the rows
aligned_df = pd.merge(self_reported[['reported_by', 'task', 'task_category', 'matthews_correlation']],
                      other_reported[['reported_by', 'task', 'task_category', 'matthews_correlation']],
                      on=['task', 'task_category'],
                      suffixes=('_self', '_other'))

# map reported category to opacity values
opacity_mapping = {'NT': 0.5, 'us': 1.0}
aligned_df['opacity'] = aligned_df['reported_by_other'].map(opacity_mapping)

# create a scatter plot
fig = px.scatter(aligned_df,
                 x='matthews_correlation_self',
                 y='matthews_correlation_other',
                 color='task_category',
                 color_discrete_sequence=colors,
                 opacity=aligned_df['opacity'],
                 hover_data=['task', 'reported_by_other'],
                 title='DNABERT2 MCC Comparison')
fig.update_traces(marker={'size': 15})
fig.show()

In [None]:
# filter data to single model
filtered_df = merged_df[merged_df['model'] == 'nt_NT_500_1000g']
self_reported = filtered_df[filtered_df['reported_by'] == 'NT']
other_reported = filtered_df[filtered_df['reported_by'] != 'NT']

# merge 'self_reported' and 'other_reported' to align the rows
aligned_df = pd.merge(self_reported[['reported_by', 'task', 'task_category', 'matthews_correlation']],
                      other_reported[['reported_by', 'task', 'task_category', 'matthews_correlation']],
                      on=['task', 'task_category'],
                      suffixes=('_self', '_other'))

# map reported category to opacity values
opacity_mapping = {'dnabert': 0.5, 'us': 1.0}
aligned_df['opacity'] = aligned_df['reported_by_other'].map(opacity_mapping)

# create a scatter plot
fig = px.scatter(aligned_df,
                 x='matthews_correlation_self',
                 y='matthews_correlation_other',
                 color='task_category',
                 color_discrete_sequence=colors,
                 opacity=aligned_df['opacity'],
                 hover_data=['task', 'reported_by_other'],
                 title='Nucleotide Transformer MCC Comparison')
fig.update_traces(marker={'size': 15})
fig.show()

##**Violin Plots**

*  show results for tasks that had a large standard deviation (≥ 2%) i.e., more variability than we expected.

In [None]:
mean_mcc[(mean_mcc['std'] >= 0.03)]

Unnamed: 0,model,task,task_category,task_benchmark,mean,std
15,dnabert1_6,mouse_3,transcription factor prediction (mouse),GUE,0.57,0.100708
19,dnabert1_6,prom_300_tata,promoter detection (human),GUE,0.66,0.058395
33,dnabert2,H3K14ac,epigenetic mark prediction (yeast),GUE,0.5,0.036057
34,dnabert2,H3K36me3,epigenetic mark prediction (yeast),GUE,0.55,0.034612
36,dnabert2,H3K4me2,epigenetic mark prediction (yeast),GUE,0.31,0.031763
37,dnabert2,H3K4me3,epigenetic mark prediction (yeast),GUE,0.34,0.061783
41,dnabert2,H4ac,epigenetic mark prediction (yeast),GUE,0.44,0.03992
42,dnabert2,covid,covid variant classification (virus),GUE,0.03,0.035098
48,dnabert2,mouse_3,transcription factor prediction (mouse),GUE,0.75,0.05142
52,dnabert2,prom_300_tata,promoter detection (human),GUE,0.63,0.039657


In [None]:
fig = px.violin(df[(df['model'] == 'dnabert1_6') & (df['task'] == 'mouse_3')],
                x='matthews_correlation',
                points='all',
                box=True,
                color='model',
                color_discrete_map=model_color_map,
                title='DNABERT-1 mouse_3 Task Results')
fig.update_traces(pointpos=0)
fig.show()

In [None]:
fig = px.violin(df[(df['model'] == 'dnabert2') & (df['task'] == 'H3K4me3')],
                x='matthews_correlation',
                points='all',
                box=True,
                color='model',
                color_discrete_map=model_color_map,
                title='DNABERT2 H3K4me3 Task Results')
fig.update_traces(pointpos=0)
fig.show()