# 1.BLOCKER and CRITICAL Ant issues analysis

## Loading data

Query used to get all the described issues and connection string to the database (Postgres):

In [1]:
query_issues = '''
    select
        i.kee as uuid,
        i.severity,
        i.message as message,
        i.line as line,
        p.name as file_name,
        m.name as metric,
        l.value as value
    from
        issues i 
        inner join projects p on i.component_uuid = p.uuid
        inner join live_measures l on i.component_uuid = l.component_uuid
        inner join metrics m on l.metric_id = m.id
    where
        i.project_uuid = 'AWwKYKvNNVnBRBMSHei7'
        -- dívidas técnicas com tipo BLOCKER ou CRITICAL
        and i.severity in ('BLOCKER', 'CRITICAL')
        and l.metric_id in (3, 18, 47) -- metricas que se deseja extrair do arquivo em questao'''

connection_url = 'postgresql://sonar:sonar@localhost/sonar'

Importing analysis libraries:

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as ss

Loading the results into a Dataframe:

In [3]:
df_issues = pd.read_sql(query_issues, connection_url)
df_issues.head()

Unnamed: 0,uuid,severity,message,line,file_name,metric,value
0,AWwKYN7qIFKoM8TmMFvQ,CRITICAL,Refactor this method to reduce its Cognitive C...,789,ZipFile.java,ncloc,577.0
1,AWwKYN7qIFKoM8TmMFvQ,CRITICAL,Refactor this method to reduce its Cognitive C...,789,ZipFile.java,complexity,96.0
2,AWwKYN7qIFKoM8TmMFvO,CRITICAL,Do not override the Object.finalize() method.,409,ZipFile.java,ncloc,577.0
3,AWwKYN7qIFKoM8TmMFvO,CRITICAL,Do not override the Object.finalize() method.,409,ZipFile.java,complexity,96.0
4,AWwKYN7qIFKoM8TmMFvM,CRITICAL,Make sure this file handling is safe here.,178,ZipFile.java,ncloc,577.0


## Insights

Size of the Dataframe

In [4]:
df_issues.shape

(2369, 7)

Issues count per severity

In [5]:
df_issues.drop_duplicates('uuid').groupby('severity').count().uuid

severity
BLOCKER      88
CRITICAL    992
Name: uuid, dtype: int64

Descriptive statistics per metric type (complexity and ncloc)

In [14]:
df_issues.loc[df_issues['metric'] == 'complexity'].drop_duplicates('file_name')['value'].describe()

Unnamed: 0,uuid,severity,message,line,file_name,metric,value
1,AWwKYN7qIFKoM8TmMFvQ,CRITICAL,Refactor this method to reduce its Cognitive C...,789,ZipFile.java,complexity,96.0
3,AWwKYN7qIFKoM8TmMFvO,CRITICAL,Do not override the Object.finalize() method.,409,ZipFile.java,complexity,96.0
5,AWwKYN7qIFKoM8TmMFvM,CRITICAL,Make sure this file handling is safe here.,178,ZipFile.java,complexity,96.0
7,AWwKYN7qIFKoM8TmMFvL,CRITICAL,Make sure this file handling is safe here.,164,ZipFile.java,complexity,96.0
9,AWwKYN7hIFKoM8TmMFvE,CRITICAL,Refactor this method to reduce its Cognitive C...,732,ZipEntry.java,complexity,117.0
11,AWwKYN7hIFKoM8TmMFvA,BLOCKER,"Remove this ""clone"" implementation; use a copy...",178,ZipEntry.java,complexity,117.0
13,AWwKYN7hIFKoM8TmMFu-,BLOCKER,"""name"" is the name of a field in ""ZipEntry"".",79,ZipEntry.java,complexity,117.0
15,AWwKYN7hIFKoM8TmMFu9,BLOCKER,"""size"" is the name of a field in ""ZipEntry"".",72,ZipEntry.java,complexity,117.0
17,AWwKYN7hIFKoM8TmMFu8,BLOCKER,"""method"" is the name of a field in ""ZipEntry"".",64,ZipEntry.java,complexity,117.0
19,AWwKYN7hIFKoM8TmMFu7,CRITICAL,Rename this class.,49,ZipEntry.java,complexity,117.0


In [9]:
df_issues.loc[df_issues['metric'] == 'ncloc', 'value'].describe()

count    1080.000000
mean      441.074074
std       380.976573
min         5.000000
25%       151.000000
50%       333.000000
75%       652.000000
max      1667.000000
Name: value, dtype: float64

In [10]:
df_issues.loc[df_issues['metric'] == 'duplicated_lines', 'value'].describe()

count    209.000000
mean     259.401914
std      347.358232
min       18.000000
25%       32.000000
50%       67.000000
75%      268.000000
max      961.000000
Name: value, dtype: float64

---
# 2.Same analysis, but with all the files in the project for comparison

## Loading data

Query used to get all the described issues and connection string to the database (Postgres):

In [11]:
query_all = """
    select
        p.name as file_name,
        m.name as metric,
        l.value as value
    from 
        projects p
        inner join live_measures l on p.uuid = l.component_uuid
        inner join metrics m on l.metric_id = m.id
    where
        p.project_uuid = 'AWwKYKvNNVnBRBMSHei7'
        and l.metric_id in (3, 18, 47) -- metricas que se deseja extrair do arquivo em questao
        and p."scope" = 'FIL' and p.qualifier = 'FIL'"""

Loading the results into a Dataframe:

In [12]:
df_all = pd.read_sql(query_all, connection_url)
df_all.head()

Unnamed: 0,file_name,metric,value
0,MultiRootFileSet.java,complexity,38.0
1,MultiRootFileSet.java,ncloc,155.0
2,Type.java,complexity,10.0
3,Type.java,ncloc,40.0
4,Not.java,complexity,5.0


## Insights

Size of the Dataframe

In [13]:
df_all.shape

(1939, 3)

Descriptive statistics per metric type (complexity and ncloc)

In [14]:
df_all.loc[df_all['metric'] == 'complexity', 'value'].describe()

count    923.000000
mean      27.565547
std       43.122798
min        0.000000
25%        4.000000
50%       13.000000
75%       31.000000
max      401.000000
Name: value, dtype: float64

In [15]:
df_all.loc[df_all['metric'] == 'ncloc', 'value'].describe()

count     925.000000
mean      121.480000
std       176.656522
min         1.000000
25%        24.000000
50%        61.000000
75%       134.000000
max      1667.000000
Name: value, dtype: float64

In [16]:
df_all.loc[df_all['metric'] == 'duplicated_lines', 'value'].describe()

count     91.000000
mean      77.197802
std      128.908083
min       11.000000
25%       26.500000
50%       42.000000
75%       75.500000
max      961.000000
Name: value, dtype: float64

---
# 3.Analysing metrics per issue type (rule)

Now, let's add the columns `rule_id` and replace `message` to (rule) `name` in the query. 

This queries all the issues (technical debt) of types 'BLOCKER' and 'CRITICAL' and gets the rule that it is breaking. This way, we'll be able to get the metrics (like "complexity" or "ncloc") per rule and make the analysis.

In [17]:
query_rules = '''
    select
        i.rule_id,
        r.name,
        m.name as metric,
        l.value as value
    from
        issues i 
        inner join rules r on i.rule_id = r.id
        inner join projects p on i.component_uuid = p.uuid
        inner join live_measures l on i.component_uuid = l.component_uuid
        inner join metrics m on l.metric_id = m.id
    where
        i.project_uuid = 'AWwKYKvNNVnBRBMSHei7'
        -- dívidas técnicas com tipo BLOCKER ou CRITICAL
        and i.severity in ('BLOCKER', 'CRITICAL')
        and l.metric_id in (3, 18)--, 47) -- metricas que se deseja extrair do arquivo em questao'''

Loading the data

In [18]:
df_rules = pd.read_sql(query_rules, connection_url)
df_rules.head()

Unnamed: 0,rule_id,name,metric,value
0,5510,Cognitive Complexity of methods should not be ...,ncloc,577.0
1,5510,Cognitive Complexity of methods should not be ...,complexity,96.0
2,5245,The Object.finalize() method should not be ove...,ncloc,577.0
3,5245,The Object.finalize() method should not be ove...,complexity,96.0
4,5370,Handling files is security-sensitive,ncloc,577.0


## Insights

General information

In [19]:
df_rules['name'].unique()

array(['Cognitive Complexity of methods should not be too high',
       'The Object.finalize() method should not be overriden',
       'Handling files is security-sensitive',
       '"clone" should not be overridden',
       'Child class fields should not shadow parent class fields',
       'Class names should not shadow interfaces or superclasses',
       'Constants should not be defined in interfaces',
       'Constant names should comply with a naming convention',
       'Methods should not be empty',
       'Using Sockets is security-sensitive',
       'Dynamically executing code is security-sensitive',
       'Expanding archive files is security-sensitive',
       'Using command line arguments is security-sensitive',
       'String literals should not be duplicated',
       'Untrusted XML should be parsed with a local, static DTD',
       '"switch" statements should have "default" clauses',
       'Reading the Standard Input is security-sensitive',
       'Using regular expression

In [20]:
df_rules['name'].unique().shape

(39,)

## Grouping by `rule_id` and calculating the metric's statistics

In [21]:
df_rules_grouped = df_rules.groupby(['rule_id', 'name', 'metric'])

Some counting 

In [22]:
df_rules_grouped.count().sort_values('value', ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
rule_id,name,metric,Unnamed: 3_level_1
5510,Cognitive Complexity of methods should not be too high,ncloc,292
5510,Cognitive Complexity of methods should not be too high,complexity,292
5370,Handling files is security-sensitive,ncloc,197
5370,Handling files is security-sensitive,complexity,197
5413,Methods should not be empty,ncloc,128
5413,Methods should not be empty,complexity,128
5098,String literals should not be duplicated,complexity,119
5098,String literals should not be duplicated,ncloc,119
5297,Dynamically executing code is security-sensitive,ncloc,80
5297,Dynamically executing code is security-sensitive,complexity,80


Some statistics

In [25]:
df_agg = df_rules_grouped.agg(['describe'])
df_agg.columns = df_agg.columns.droplevel().droplevel()

df_agg.sort_values(['count', 'rule_id', 'metric'], ascending=False).head(16)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
rule_id,name,metric,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5510,Cognitive Complexity of methods should not be too high,ncloc,292.0,541.84589,404.334276,53.0,248.25,434.0,710.5,1667.0
5510,Cognitive Complexity of methods should not be too high,complexity,292.0,130.458904,98.710039,9.0,57.75,107.0,177.0,401.0
5370,Handling files is security-sensitive,ncloc,197.0,493.467005,358.039097,15.0,231.0,391.0,721.0,1667.0
5370,Handling files is security-sensitive,complexity,197.0,116.771574,91.529596,4.0,49.0,80.0,179.0,401.0
5413,Methods should not be empty,ncloc,128.0,260.648438,269.51831,5.0,85.0,151.0,289.0,1193.0
5413,Methods should not be empty,complexity,128.0,57.476562,61.94981,1.0,21.0,31.5,62.0,293.0
5098,String literals should not be duplicated,ncloc,119.0,545.529412,395.591722,22.0,249.0,434.0,758.0,1667.0
5098,String literals should not be duplicated,complexity,119.0,129.033613,97.491758,3.0,61.0,99.0,173.5,401.0
5297,Dynamically executing code is security-sensitive,ncloc,80.0,297.3125,283.405135,32.0,90.0,191.0,388.0,1245.0
5297,Dynamically executing code is security-sensitive,complexity,80.0,63.4375,67.151603,2.0,16.5,40.0,80.75,295.0


# REDO

Now let's compute some **correlation**:

First we need a correlation function for categorical variables, we're gonna use Cramers V

In [24]:
def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

Then, let's calculate the confusion matrix for each metric

In [25]:
df_complexity = df_rules.loc[df_rules['metric'] == 'complexity', ['name', 'value']]
df_ncloc = df_rules.loc[df_rules['metric'] == 'ncloc', ['name', 'value']]
conf_matrix_complexity = pd.crosstab(df_complexity['name'], df_complexity['value'])
conf_matrix_ncloc = pd.crosstab(df_ncloc['name'], df_ncloc['value'])

Finally, the correlation:

In [26]:
# complexity
cramers_corrected_stat(conf_matrix_complexity)

0.2966368467722552

In [27]:
# ncloc
cramers_corrected_stat(conf_matrix_ncloc)

0.35527362654266226