# 1.BLOCKER and CRITICAL Maven issues analysis

## Loading data

Query used to get all the described issues and connection string to the database (Postgres):

In [2]:
query_issues = '''
    select
        i.kee as uuid,
        i.severity,
        i.message as message,
        i.line as line,
        p.name as file_name,
        m.name as metric,
        l.value as value
    from
        issues i 
        inner join projects p on i.component_uuid = p.uuid
        inner join live_measures l on i.component_uuid = l.component_uuid
        inner join metrics m on l.metric_id = m.id
    where
        -- dívidas técnicas com tipo BLOCKER ou CRITICAL
        i.severity in ('BLOCKER', 'CRITICAL')
        and l.metric_id in (3, 18) -- metricas que se deseja extrair do arquivo em questao'''

connection_url = 'postgresql://sonar:sonar@localhost/sonar'

Importing analysis libraries:

In [72]:
import pandas as pd
import numpy as np
import scipy.stats as ss

Loading the results into a Dataframe:

In [4]:
df_issues = pd.read_sql(query_issues, connection_url)
df_issues.head()

Unnamed: 0,uuid,severity,message,line,file_name,metric,value
0,AWrwqUzAm0KLequXiiF4,CRITICAL,Refactor this method to reduce its Cognitive C...,51,SystemPropertyProfileActivator.java,ncloc,78.0
1,AWrwqUzAm0KLequXiiF4,CRITICAL,Refactor this method to reduce its Cognitive C...,51,SystemPropertyProfileActivator.java,complexity,11.0
2,AWrwqUzJm0KLequXiiF5,CRITICAL,Move constants to a class or enum.,28,ProfileActivator.java,ncloc,10.0
3,AWrwqUzJm0KLequXiiF5,CRITICAL,Move constants to a class or enum.,28,ProfileActivator.java,complexity,0.0
4,AWrwqUzYm0KLequXiiF-,CRITICAL,Move constants to a class or enum.,31,MavenProfilesBuilder.java,ncloc,11.0


## Insights

Size of the Dataframe

In [5]:
df_issues.shape

(1066, 7)

Issues count per severity

In [6]:
df_issues.drop_duplicates('uuid').groupby('severity').count().uuid

severity
BLOCKER      15
CRITICAL    518
Name: uuid, dtype: int64

Descriptive statistics per metric type (complexity and ncloc)

In [7]:
df_issues.loc[df_issues['metric'] == 'complexity', 'value'].describe()

count    533.000000
mean      68.324578
std       99.771608
min        0.000000
25%       10.000000
50%       37.000000
75%       69.000000
max      664.000000
Name: value, dtype: float64

In [8]:
df_issues.loc[df_issues['metric'] == 'ncloc', 'value'].describe()

count     533.000000
mean      369.288931
std       466.194592
min         6.000000
25%        62.000000
50%       211.000000
75%       443.000000
max      2693.000000
Name: value, dtype: float64

The same analysis, but without the _test_ files

---
# 2.Same analysis, but with all the files in the project for comparison

## Loading data

Query used to get all the described issues and connection string to the database (Postgres):

In [9]:
query_all = """
    select
        p.name as file_name,
        m.name as metric,
        l.value as value
    from 
        projects p
        inner join live_measures l on p.uuid = l.component_uuid
        inner join metrics m on l.metric_id = m.id
    where
        l.metric_id in (3, 18) -- metricas que se deseja extrair do arquivo em questao
        and p."scope" = 'FIL' and p.qualifier = 'FIL'"""

Loading the results into a Dataframe:

In [10]:
df_all = pd.read_sql(query_all, connection_url)
df_all.head()

Unnamed: 0,file_name,metric,value
0,DefaultMavenProfilesBuilder.java,complexity,2.0
1,DefaultMavenProfilesBuilder.java,ncloc,54.0
2,ProfileManager.java,complexity,0.0
3,ProfileManager.java,ncloc,23.0
4,DefaultProfileManager.java,complexity,27.0


## Insights

Size of the Dataframe

In [11]:
df_all.shape

(1415, 3)

Descriptive statistics per metric type (complexity and ncloc)

In [12]:
df_all.loc[df_all['metric'] == 'complexity', 'value'].describe()

count    700.000000
mean      14.334286
std       35.735593
min        0.000000
25%        0.000000
50%        5.000000
75%       15.000000
max      664.000000
Name: value, dtype: float64

In [13]:
df_all.loc[df_all['metric'] == 'ncloc', 'value'].describe()

count     715.000000
mean       88.925874
std       172.153021
min         1.000000
25%        14.500000
50%        40.000000
75%        95.000000
max      2693.000000
Name: value, dtype: float64

---
# 3.Analysing metrics per issue type (rule)

Now, let's add the columns `rule_id` and replace `message` to (rule) `name` in the query. 

This queries all the issues (technical debt) of types 'BLOCKER' and 'CRITICAL' and gets the rule that it is breaking. This way, we'll be able to get the metrics (like "complexity" or "ncloc") per rule and make the analysis.

In [14]:
query_rules = '''
    select
        i.rule_id,
        r.name,
        m.name as metric,
        l.value as value
    from
        issues i 
        inner join rules r on i.rule_id = r.id
        inner join projects p on i.component_uuid = p.uuid
        inner join live_measures l on i.component_uuid = l.component_uuid
        inner join metrics m on l.metric_id = m.id
    where
        -- dívidas técnicas com tipo BLOCKER ou CRITICAL
        i.severity in ('BLOCKER', 'CRITICAL')
        and l.metric_id in (3, 18) -- metricas que se deseja extrair do arquivo em questao'''

Loading the data

In [37]:
df_rules = pd.read_sql(query_rules, connection_url)
df_rules.head()

Unnamed: 0,rule_id,name,metric,value
0,5510,Cognitive Complexity of methods should not be ...,ncloc,78.0
1,5510,Cognitive Complexity of methods should not be ...,complexity,11.0
2,5290,Constants should not be defined in interfaces,ncloc,10.0
3,5290,Constants should not be defined in interfaces,complexity,0.0
4,5290,Constants should not be defined in interfaces,ncloc,11.0


## Insights

General information

In [38]:
df_rules['name'].unique()

array(['Cognitive Complexity of methods should not be too high',
       'Constants should not be defined in interfaces',
       'String literals should not be duplicated',
       '"ThreadGroup" should not be used', 'Methods should not be empty',
       'Constant names should comply with a naming convention',
       'Handling files is security-sensitive',
       'Fields in a "Serializable" class should either be transient or serializable',
       'Class names should not shadow interfaces or superclasses',
       'Resources should be closed', 'Try-with-resources should be used',
       'Dynamically executing code is security-sensitive',
       'Changing or bypassing accessibility is security-sensitive',
       'Short-circuit logic should be used in boolean contexts',
       '"Random" objects should be reused',
       'Using pseudorandom number generators (PRNGs) is security-sensitive',
       '"clone" should not be overridden',
       'Expanding archive files is security-sensitive',
    

In [39]:
df_rules['name'].unique().shape

(25,)

## Grouping by `rule_id` and calculating the metric's statistics

In [54]:
df_rules_grouped = df_rules.groupby(['rule_id', 'name', 'metric'])

Some counting 

In [55]:
df_rules_grouped.count().sort_values('value', ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
rule_id,name,metric,Unnamed: 3_level_1
5510,Cognitive Complexity of methods should not be too high,complexity,110
5510,Cognitive Complexity of methods should not be too high,ncloc,110
5413,Methods should not be empty,complexity,93
5413,Methods should not be empty,ncloc,93
5370,Handling files is security-sensitive,complexity,79
5370,Handling files is security-sensitive,ncloc,79
5098,String literals should not be duplicated,ncloc,72
5098,String literals should not be duplicated,complexity,72
5290,Constants should not be defined in interfaces,ncloc,46
5290,Constants should not be defined in interfaces,complexity,46


Some statistics

In [56]:
df_agg = df_rules_grouped.agg(['describe'])
df_agg.columns = df_agg.columns.droplevel().droplevel()

df_agg.sort_values(['count', 'rule_id', 'metric'], ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
rule_id,name,metric,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5510,Cognitive Complexity of methods should not be too high,ncloc,110.0,520.745455,379.616041,50.0,205.25,438.5,724.0,1404.0
5510,Cognitive Complexity of methods should not be too high,complexity,110.0,90.309091,72.819463,8.0,34.0,63.0,125.0,314.0
5413,Methods should not be empty,ncloc,93.0,211.731183,205.52193,44.0,106.0,211.0,211.0,1404.0
5413,Methods should not be empty,complexity,93.0,52.956989,41.182762,10.0,29.0,62.0,62.0,314.0
5370,Handling files is security-sensitive,ncloc,79.0,509.822785,455.047187,24.0,166.0,347.0,762.0,1404.0
5370,Handling files is security-sensitive,complexity,79.0,84.974684,81.440237,0.0,26.0,58.0,123.5,314.0
5098,String literals should not be duplicated,ncloc,72.0,723.291667,781.323659,53.0,186.5,482.0,946.0,2693.0
5098,String literals should not be duplicated,complexity,72.0,148.277778,195.60518,3.0,29.75,61.5,190.0,664.0
5290,Constants should not be defined in interfaces,ncloc,46.0,36.108696,83.96672,6.0,10.0,14.0,27.0,568.0
5290,Constants should not be defined in interfaces,complexity,46.0,3.282609,20.658776,0.0,0.0,0.0,0.0,140.0


Now let's compute some **correlation**:

First we need a correlation function for categorical variables, we're gonna use Cramers V

In [74]:
def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

Then, let's calculate the confusion matrix for each metric

In [75]:
df_complexity = df_rules.loc[df_rules['metric'] == 'complexity', ['name', 'value']]
df_ncloc = df_rules.loc[df_rules['metric'] == 'ncloc', ['name', 'value']]
conf_matrix_complexity = pd.crosstab(df_complexity['name'], df_complexity['value'])
conf_matrix_ncloc = pd.crosstab(df_ncloc['name'], df_ncloc['value'])

Finally, the correlation:

In [77]:
# complexity
cramers_corrected_stat(conf_matrix_complexity)

0.4201825204898199

In [79]:
# ncloc
cramers_corrected_stat(conf_matrix_ncloc)

0.49842911705634374