# 1.BLOCKER and CRITICAL Maven issues analysis

## Loading data

Query used to get all the described issues and connection string to the database (Postgres):

In [4]:
query_issues = '''
    select
        i.kee as uuid,
        i.severity,
        i.message as message,
        i.line as line,
        p.name as file_name,
        m.name as metric,
        l.value as value
    from
        issues i 
        inner join projects p on i.component_uuid = p.uuid
        inner join live_measures l on i.component_uuid = l.component_uuid
        inner join metrics m on l.metric_id = m.id
    where
        i.project_uuid = 'AWrwqThXS_LSSKKohbva'
        -- dívidas técnicas com tipo BLOCKER ou CRITICAL
        and i.severity in ('BLOCKER', 'CRITICAL')
        and l.metric_id in (3, 18, 47) -- metricas que se deseja extrair do arquivo em questao'''

connection_url = 'postgresql://sonar:sonar@localhost/sonar'

Importing analysis libraries:

In [5]:
import pandas as pd
import numpy as np
import scipy.stats as ss

Loading the results into a Dataframe:

In [6]:
df_issues = pd.read_sql(query_issues, connection_url)
df_issues.head()

Unnamed: 0,uuid,severity,message,line,file_name,metric,value
0,AWrwqV4pm0KLequXiitL,CRITICAL,Make sure this file handling is safe here.,35,RuntimeInfo.java,ncloc,24.0
1,AWrwqV4pm0KLequXiitL,CRITICAL,Make sure this file handling is safe here.,35,RuntimeInfo.java,complexity,3.0
2,AWrwqV4pm0KLequXiitK,CRITICAL,Rename this constant name to match the regular...,32,RuntimeInfo.java,ncloc,24.0
3,AWrwqV4pm0KLequXiitK,CRITICAL,Rename this constant name to match the regular...,32,RuntimeInfo.java,complexity,3.0
4,AWrwqV4Mm0KLequXiis-,CRITICAL,"Make ""pluginDescriptor"" transient or serializa...",36,MojoNotFoundException.java,ncloc,52.0


## Insights

Size of the Dataframe

In [7]:
df_issues.shape

(1133, 7)

Issues count per severity

In [38]:
df_issues.drop_duplicates('uuid').groupby('severity').count().uuid

severity
BLOCKER      15
CRITICAL    518
Name: uuid, dtype: int64

Descriptive statistics per metric type (complexity and ncloc)

In [39]:
df_issues.loc[df_issues['metric'] == 'complexity'].groupby('file_name').sum()['value'].describe()

count     196.000000
mean      185.801020
std       674.949993
min         0.000000
25%         1.000000
50%        14.500000
75%        86.000000
max      5312.000000
Name: value, dtype: float64

In [10]:
df_issues.loc[df_issues['metric'] == 'ncloc'].drop_duplicates('file_name')['value'].describe()

count     196.000000
mean      182.974490
std       291.421335
min         6.000000
25%        27.000000
50%        92.500000
75%       204.750000
max      2693.000000
Name: value, dtype: float64

In [44]:
df_issues.loc[df_issues['metric'] == 'duplicated_lines'].groupby('file_name').sum()['value'].append(pd.Series([0]*176), ignore_index=True).describe()

count     196.000000
mean       36.806122
std       190.675085
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max      1876.000000
dtype: float64

---
# 2.Same analysis, but with all the files in the project for comparison

## Loading data

Query used to get all the described issues and connection string to the database (Postgres):

In [24]:
query_all = """
    select
        p.uuid,
        p.name as file_name,
        m.name as metric,
        l.value as value
    from 
        projects p
        inner join live_measures l on p.uuid = l.component_uuid
        inner join metrics m on l.metric_id = m.id
    where
        p.project_uuid = 'AWrwqThXS_LSSKKohbva'
        and l.metric_id in (3, 18, 47) -- metricas que se deseja extrair do arquivo em questao
        and p."scope" = 'FIL' and p.qualifier = 'FIL'"""

Loading the results into a Dataframe:

In [25]:
df_all = pd.read_sql(query_all, connection_url)
df_all.head()

Unnamed: 0,uuid,file_name,metric,value
0,AWrwqUVEm0KLequXih0X,DefaultMavenProfilesBuilder.java,complexity,2.0
1,AWrwqUVEm0KLequXih0X,DefaultMavenProfilesBuilder.java,ncloc,54.0
2,AWrwqUVEm0KLequXih0Y,ProfileManager.java,complexity,0.0
3,AWrwqUVEm0KLequXih0Y,ProfileManager.java,ncloc,23.0
4,AWrwqUVEm0KLequXih0Z,DefaultProfileManager.java,complexity,27.0


## Insights

Size of the Dataframe

In [26]:
df_all.drop_duplicates('uuid').shape

(715, 4)

Descriptive statistics per metric type (complexity and ncloc)

In [32]:
df_all.loc[df_all['metric'] == 'complexity', 'value'].append(pd.Series([0]*15), ignore_index=True).describe()

count    715.000000
mean      14.033566
std       35.417937
min        0.000000
25%        0.000000
50%        5.000000
75%       15.000000
max      664.000000
dtype: float64

In [14]:
df_all.loc[df_all['metric'] == 'ncloc', 'value'].describe()

count     715.000000
mean       88.925874
std       172.153021
min         1.000000
25%        14.500000
50%        40.000000
75%        95.000000
max      2693.000000
Name: value, dtype: float64

In [33]:
df_all.loc[df_all['metric'] == 'duplicated_lines', 'value'].append(pd.Series([0]*665), ignore_index=True).describe()

count    715.000000
mean       4.153846
std       24.169948
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max      461.000000
dtype: float64

---
# 3.Analysing metrics per issue type (rule)

Now, let's add the columns `rule_id` and replace `message` to (rule) `name` in the query. 

This queries all the issues (technical debt) of types 'BLOCKER' and 'CRITICAL' and gets the rule that it is breaking. This way, we'll be able to get the metrics (like "complexity" or "ncloc") per rule and make the analysis.

In [19]:
query_rules = '''
    select
        i.rule_id,
        r.name,
        m.name as metric,
        l.value as value
    from
        issues i 
        inner join rules r on i.rule_id = r.id
        inner join projects p on i.component_uuid = p.uuid
        inner join live_measures l on i.component_uuid = l.component_uuid
        inner join metrics m on l.metric_id = m.id
    where
        i.project_uuid = 'AWrwqThXS_LSSKKohbva'
        -- dívidas técnicas com tipo BLOCKER ou CRITICAL
        and i.severity in ('BLOCKER', 'CRITICAL')
        and l.metric_id in (3, 18, 47) -- metricas que se deseja extrair do arquivo em questao'''

Loading the data

In [20]:
df_rules = pd.read_sql(query_rules, connection_url)
df_rules.head()

Unnamed: 0,rule_id,name,metric,value
0,5370,Handling files is security-sensitive,ncloc,24.0
1,5370,Handling files is security-sensitive,complexity,3.0
2,5574,Constant names should comply with a naming con...,ncloc,24.0
3,5574,Constant names should comply with a naming con...,complexity,3.0
4,5205,"Fields in a ""Serializable"" class should either...",ncloc,52.0


## Insights

General information

In [21]:
df_rules['name'].unique()

array(['Handling files is security-sensitive',
       'Constant names should comply with a naming convention',
       'Fields in a "Serializable" class should either be transient or serializable',
       'Constants should not be defined in interfaces',
       '"clone" should not be overridden',
       'Using regular expressions is security-sensitive',
       'Cognitive Complexity of methods should not be too high',
       'String literals should not be duplicated',
       'Resources should be closed',
       'Configuring loggers is security-sensitive',
       'Methods should not be empty',
       'Dynamically executing code is security-sensitive',
       'Reading the Standard Input is security-sensitive',
       'Using command line arguments is security-sensitive',
       'Credentials should not be hard-coded',
       'Generic wildcard types should not be used in return parameters',
       'Changing or bypassing accessibility is security-sensitive',
       'Class names should not shado

In [22]:
df_rules['name'].unique().shape

(25,)

## Grouping by `rule_id` and calculating the metric's statistics

In [23]:
df_rules_grouped = df_rules.groupby(['rule_id', 'name', 'metric'])

Some counting 

In [24]:
df_rules_grouped.count().sort_values('value', ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
rule_id,name,metric,Unnamed: 3_level_1
5510,Cognitive Complexity of methods should not be too high,ncloc,110
5510,Cognitive Complexity of methods should not be too high,complexity,110
5413,Methods should not be empty,ncloc,93
5413,Methods should not be empty,complexity,93
5370,Handling files is security-sensitive,ncloc,79
5370,Handling files is security-sensitive,complexity,79
5098,String literals should not be duplicated,complexity,72
5098,String literals should not be duplicated,ncloc,72
5290,Constants should not be defined in interfaces,complexity,46
5290,Constants should not be defined in interfaces,ncloc,46


Some statistics

In [29]:
df_agg = df_rules_grouped.agg(['describe'])
df_agg.columns = df_agg.columns.droplevel().droplevel()

df_agg.sort_values(['count', 'rule_id', 'metric'], ascending=False).head(16)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
rule_id,name,metric,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5510,Cognitive Complexity of methods should not be too high,ncloc,110.0,520.745455,379.616041,50.0,205.25,438.5,724.0,1404.0
5510,Cognitive Complexity of methods should not be too high,complexity,110.0,90.309091,72.819463,8.0,34.0,63.0,125.0,314.0
5413,Methods should not be empty,ncloc,93.0,211.731183,205.52193,44.0,106.0,211.0,211.0,1404.0
5413,Methods should not be empty,complexity,93.0,52.956989,41.182762,10.0,29.0,62.0,62.0,314.0
5370,Handling files is security-sensitive,ncloc,79.0,509.822785,455.047187,24.0,166.0,347.0,762.0,1404.0
5370,Handling files is security-sensitive,complexity,79.0,84.974684,81.440237,0.0,26.0,58.0,123.5,314.0
5098,String literals should not be duplicated,ncloc,72.0,723.291667,781.323659,53.0,186.5,482.0,946.0,2693.0
5098,String literals should not be duplicated,complexity,72.0,148.277778,195.60518,3.0,29.75,61.5,190.0,664.0
5290,Constants should not be defined in interfaces,ncloc,46.0,36.108696,83.96672,6.0,10.0,14.0,27.0,568.0
5290,Constants should not be defined in interfaces,complexity,46.0,3.282609,20.658776,0.0,0.0,0.0,0.0,140.0


# Correlation between metric and rule

In [108]:
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

## Correlação entre Linhas não comentadas e Cognitive Complexity of methods should not be too high

In [171]:
query_not_ok = """
-- arquivos com a divida do tipo rule_id com a métrica metric_id
select 
    value,
    1 as present
from live_measures 
where component_uuid in (select distinct component_uuid from issues where rule_id = 5510 and project_uuid = 'AWrwqThXS_LSSKKohbva')
and metric_id = 3"""

query_ok = """
-- arquivos sem a divida do tipo rule_id com a metrica metric_id
select 
    value,
    0 as present
from live_measures 
where component_uuid not in (
  select distinct component_uuid from issues where rule_id = 5510 and project_uuid = 'AWrwqThXS_LSSKKohbva')
and component_uuid in (
select uuid from projects where project_uuid = 'AWrwqThXS_LSSKKohbva' and "scope" = 'FIL' and qualifier = 'FIL')
and metric_id = 3"""

In [172]:
df_not_ok = pd.read_sql(query_not_ok, connection_url)
df_ok = pd.read_sql(query_ok, connection_url)

In [173]:
df_both = pd.concat([df_not_ok, df_ok])

In [174]:
clf = SVC(gamma='scale')
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.91213389, 0.91596639, 0.91176471])

In [175]:
clf = KNeighborsClassifier()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scoresclf = GaussianNB()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.93723849, 0.90336134, 0.90756303])

In [176]:
clf = GaussianNB()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.92887029, 0.93277311, 0.92857143])

##  Correlação entre Complexidade e Cognitive Complexity of methods should not be too high

In [177]:
query_not_ok = """
-- arquivos com a divida do tipo rule_id com a métrica metric_id
select 
    value,
    1 as present
from live_measures 
where component_uuid in (
  select distinct component_uuid from issues where rule_id = 5510 and project_uuid = 'AWrwqThXS_LSSKKohbva'
)
and metric_id = 18"""

query_ok = """
-- arquivos sem a divida do tipo rule_id com a metrica metric_id
select 
    value,
    0 as present
from live_measures 
where component_uuid not in (
  select distinct component_uuid from issues where rule_id = 5510 and project_uuid = 'AWrwqThXS_LSSKKohbva'
)
and component_uuid in (
select uuid from projects where project_uuid = 'AWrwqThXS_LSSKKohbva' and "scope" = 'FIL' and qualifier = 'FIL')
and metric_id = 18"""

In [178]:
df_not_ok = pd.read_sql(query_not_ok, connection_url)
df_ok = pd.read_sql(query_ok, connection_url)

In [179]:
df_both = pd.concat([df_not_ok, df_ok])

In [180]:
clf = SVC(gamma='scale')
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.94017094, 0.91845494, 0.93133047])

In [181]:
clf = KNeighborsClassifier()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.91880342, 0.90987124, 0.90987124])

In [182]:
clf = GaussianNB()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.91880342, 0.93133047, 0.93562232])

## Correlação entre Linhas não comentadas e Methods should not be empty

In [183]:
query_not_ok = """
-- arquivos com a divida do tipo rule_id com a métrica metric_id
select 
    value,
    1 as present
from live_measures 
where component_uuid in (
  select distinct component_uuid from issues where rule_id = 5413 and project_uuid = 'AWrwqThXS_LSSKKohbva'
)
and metric_id = 3"""

query_ok = """
-- arquivos sem a divida do tipo rule_id com a metrica metric_id
select 
    value,
    0 as present
from live_measures 
where component_uuid not in (
  select distinct component_uuid from issues where rule_id = 5413 and project_uuid = 'AWrwqThXS_LSSKKohbva'
)
and component_uuid in (
select uuid from projects where project_uuid = 'AWrwqThXS_LSSKKohbva' and "scope" = 'FIL' and qualifier = 'FIL')
and metric_id = 3"""

In [184]:
df_not_ok = pd.read_sql(query_not_ok, connection_url)
df_ok = pd.read_sql(query_ok, connection_url)

In [185]:
df_both = pd.concat([df_not_ok, df_ok])

In [186]:
clf = SVC(gamma='scale')
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.9748954 , 0.97478992, 0.97478992])

In [187]:
clf = KNeighborsClassifier()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.9748954 , 0.97478992, 0.97478992])

In [188]:
clf = GaussianNB()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.9707113 , 0.96638655, 0.94117647])

## Correlação entre Complexidade e Methods should not be empty

In [189]:
query_not_ok = """
-- arquivos com a divida do tipo rule_id com a métrica metric_id
select 
    value,
    1 as present
from live_measures 
where component_uuid in (
  select distinct component_uuid from issues where rule_id = 5413 and project_uuid = 'AWrwqThXS_LSSKKohbva'
)
and metric_id = 18"""

query_ok = """
-- arquivos sem a divida do tipo rule_id com a metrica metric_id
select 
    value,
    0 as present
from live_measures 
where component_uuid not in (
  select distinct component_uuid from issues where rule_id = 5413 and project_uuid = 'AWrwqThXS_LSSKKohbva'
)
and component_uuid in (
select uuid from projects where project_uuid = 'AWrwqThXS_LSSKKohbva' and "scope" = 'FIL' and qualifier = 'FIL')
and metric_id = 18"""

In [190]:
df_not_ok = pd.read_sql(query_not_ok, connection_url)
df_ok = pd.read_sql(query_ok, connection_url)

In [191]:
df_both = pd.concat([df_not_ok, df_ok])

In [192]:
clf = SVC(gamma='scale')
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.97435897, 0.97424893, 0.97424893])

In [193]:
clf = KNeighborsClassifier()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.97435897, 0.97424893, 0.97424893])

In [194]:
clf = GaussianNB()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.97863248, 0.97424893, 0.93991416])

## Correlação entre Linhas não comentadas e Handling files is security-sensitive

In [195]:
query_not_ok = """
-- arquivos com a divida do tipo rule_id com a métrica metric_id
select 
    value,
    1 as present
from live_measures 
where component_uuid in (
  select distinct component_uuid from issues where rule_id = 5370 and project_uuid = 'AWrwqThXS_LSSKKohbva'
)
and metric_id = 3"""

query_ok = """
-- arquivos sem a divida do tipo rule_id com a metrica metric_id
select 
    value,
    0 as present
from live_measures 
where component_uuid not in (
  select distinct component_uuid from issues where rule_id = 5370 and project_uuid = 'AWrwqThXS_LSSKKohbva'
)
and component_uuid in (
select uuid from projects where project_uuid = 'AWrwqThXS_LSSKKohbva' and "scope" = 'FIL' and qualifier = 'FIL')
and metric_id = 3"""

In [196]:
df_not_ok = pd.read_sql(query_not_ok, connection_url)
df_ok = pd.read_sql(query_ok, connection_url)

In [197]:
df_both = pd.concat([df_not_ok, df_ok])

In [198]:
clf = SVC(gamma='scale')
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.94560669, 0.94117647, 0.93697479])

In [199]:
clf = KNeighborsClassifier()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.9539749 , 0.94537815, 0.93697479])

In [200]:
clf = GaussianNB()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.9539749 , 0.94957983, 0.92857143])

## Correlação entre Complexidade e Handling files is security-sensitive

In [201]:
query_not_ok = """
-- arquivos com a divida do tipo rule_id com a métrica metric_id
select 
    value,
    1 as present
from live_measures 
where component_uuid in (
  select distinct component_uuid from issues where rule_id = 5370 and project_uuid = 'AWrwqThXS_LSSKKohbva'
)
and metric_id = 18"""

query_ok = """
-- arquivos sem a divida do tipo rule_id com a metrica metric_id
select 
    value,
    0 as present
from live_measures 
where component_uuid not in (
  select distinct component_uuid from issues where rule_id = 5370 and project_uuid = 'AWrwqThXS_LSSKKohbva'
)
and component_uuid in (
select uuid from projects where project_uuid = 'AWrwqThXS_LSSKKohbva' and "scope" = 'FIL' and qualifier = 'FIL')
and metric_id = 18"""

In [202]:
df_not_ok = pd.read_sql(query_not_ok, connection_url)
df_ok = pd.read_sql(query_ok, connection_url)

In [203]:
df_both = pd.concat([df_not_ok, df_ok])

In [204]:
clf = SVC(gamma='scale')
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.94444444, 0.94420601, 0.94849785])

In [205]:
clf = KNeighborsClassifier()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.94444444, 0.94849785, 0.94420601])

In [206]:
clf = GaussianNB()
scores = cross_val_score(clf, df_both.value.values.reshape(-1,1), df_both.present, cv=3)
scores

array([0.94871795, 0.94849785, 0.91845494])