<a href="https://colab.research.google.com/github/armandossrecife/piloto/blob/main/satd_analyzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# My SATD Analyzer

1. Identify Keywords for Self-Admitted Technical Debt (SATD):
Start by identifying keywords or phrases that commonly indicate self-admitted technical debt. These keywords can be derived from literature or based on your understanding of common terms used in code comments, commit messages, and issue descriptions. For example, common keywords may include "TODO," "FIXME," "refactor," "technical debt," "hack," "workaround," etc.

2. Create a Dictionary of SATD Keywords:
Build a dictionary or a set of SATD keywords that will serve as the basis for content queries in commit messages, modified file comments, and issue descriptions. This dictionary should include the keywords identified in Step 1.

3. Using PyDriller for Commit Analysis:
You can use PyDriller to analyze commits from the repository. PyDriller is a Python library for analyzing Git repositories. Install it using pip.
Create Python scripts that use PyDriller to iterate through the commits and identify those that contain SATD keywords in their messages. You can filter commits based on these keywords and store them in a Set.

4. Analyze Modified Files in Commits:
For each commit, extract the list of modified files. Open and analyze these files to check for SATD keywords in the code comments within the modified lines. Store the commits with modified files containing SATD in a separate Set.

5. Retrieve Issues from the Issue Tracker:
Utilize an API or libraries like JIRA's REST API to fetch issues from the issue tracker (JIRA in this case). For example, you can use Python's requests library to interact with the JIRA API. Query issues based on project (e.g., CASSANDRA) and extract the content of the Summary, Description, and Comments fields for each issue.

6. Analyze Issue Descriptions and Comments:
Analyze the content of the Summary, Description, and Comments fields for the presence of SATD keywords. Store the issues that contain these keywords in a Set.

7. Combine Results:
Combine the sets of commits and issues identified with SATD keywords from Steps 3, 4, and 6.

8. Display or Save Results:
You can choose to display the list of commits and issues with SATD keywords or save this information to a file or a database for further analysis.

9. Additional Preprocessing:
Depending on the quality of your results, you may need to perform additional preprocessing, such as removing false positives or refining the set of SATD keywords.



# 1. Install the dependencies

In [None]:
print('Install Pydriller.')
!pip install pydriller > install_pydriller.log
print('Install gitpython.')
!pip3 install gitpython > install_gitpython.log
print('Install Jira Python lib.')
!pip install jira > install_jira_python.log
!sudo apt install -y sqlite3 > install_sqlite.log
print('All depenpencies installed!')
!cat *.log > install.log
print('Details in install.log')

Install Pydriller.
Install gitpython.
Install Jira Python lib.


All depenpencies installed!
cat: install.log: input file is output file
Details in install.log


# 2. Import dependencies, configurations and supporting classes

In [None]:
from pydriller import Repository
from datetime import datetime
import tqdm
import os
import re

from jira import JIRA
import os
from datetime import datetime, timedelta
import tqdm
import sqlite3
import subprocess
import sqlite3

In [None]:
url_to_repository = 'https://github.com/apache/cassandra.git'
path_to_repository = 'cassandra'
os.environ['MY_REPOSITORY'] = url_to_repository

JIRA_SERVER = 'https://issues.apache.org/jira'
DATABASE_NAME = "issues_db.db"

os.environ['DATABASE_NAME'] = DATABASE_NAME
# Credentials
os.environ['USERNAME'] = '?'
os.environ['PASSWORD'] = '?'
username = os.environ.get('USERNAME')
password = os.environ.get('PASSWORD')

In [None]:
class SATDCommitAnalyzer:
    def __init__(self, path_to_repository):
        self.path_to_repository = path_to_repository

    @staticmethod
    def is_java_comment(line):
        # Regular expression pattern to match Java comments
        comment_pattern = r'^\s*//|^\s*/\*|^\s*\*|^\s*\*/'

        # Use the re.match function to check if the line matches the comment pattern
        return bool(re.match(comment_pattern, line))

    def analyze_commits_for_satd(self, start_date, end_date, satd_keywords):
        # Initialize sets to store commits and their associated SATD keywords
        commits_with_satd = set()
        dict_commit_msg = {}

        # Traverse commits within the specified date range
        print('Aguarde...')
        my_traverser_commits = Repository(self.path_to_repository, since=start_date, to=end_date).traverse_commits()
        total_commits = len(list(my_traverser_commits))

        for commit in tqdm.tqdm(Repository(self.path_to_repository, since=start_date, to=end_date).traverse_commits(), total=total_commits, desc="Progress commit analysis"):
            for keyword in satd_keywords:
                if keyword in commit.msg:
                    commits_with_satd.add(commit.hash)
                    dict_commit_msg[commit.hash] = commit.msg

        return commits_with_satd, dict_commit_msg

    def analyze_commit_diffs_for_satd(self, start_date, end_date, satd_keywords):
        # Initialize a dictionary to store commit hashes and their associated SATD keywords and diff content
        dict_commit_diffs = {}

        # Traverse commits within the specified date range
        print('Aguarde...')
        my_traverser_commits = Repository(self.path_to_repository, since=start_date, to=end_date).traverse_commits()
        total_commits = len(list(my_traverser_commits))

        for commit in tqdm.tqdm(Repository(self.path_to_repository, since=start_date, to=end_date).traverse_commits(), total=total_commits, desc="Progress commit analysis"):
            list_keywords_by_commit = []
            list_diff_content_by_commit = []

            for modification in commit.modified_files:
                for line in modification.diff_parsed['added']:
                    valor_linha = line[0]
                    conteudo_linha = line[1]

                    for keyword in satd_keywords:
                        if SATDCommitAnalyzer.is_java_comment(conteudo_linha):
                            if keyword in conteudo_linha:
                                list_keywords_by_commit.append(keyword)
                                list_diff_content_by_commit.append(conteudo_linha)

            if list_keywords_by_commit:
                elemento = list_keywords_by_commit, list_diff_content_by_commit
                dict_commit_diffs[commit.hash] = elemento

        return dict_commit_diffs

In [None]:
class JiraIssue:
  def __init__(self, key, summary, issue_type, status, priority, description, comments):
    self.key = key
    self.summary = summary
    self.issue_type = issue_type
    self.status = status
    self.priority = priority
    self.description = description
    self.comments = comments

  def get_comments(self) -> dict:
    return self.comments

  def __str__(self):
    return (f'Key: {self.key}, Summary: {self.summary}, Type: {self.issue_type}, Status: {self.status}')

class JiraIssues:
  def __init__(self,project, issues):
    self.project = project
    self.issues = issues

  def add_issue(self, issue):
    self.issues.append(issue)

  def get_issues(self) -> list:
    return self.issues

  def update_issues(self, issues):
    self.issues = issues

  def __str__(self):
    str_issues = ""
    for issue in self.get_issues():
      str_issues = str_issues + str(issue)
      str_issues = str_issues + ', '
    str_issues = '[' + str_issues + ']'
    return (f'Project: {self.project}, Qdt of issues: {len(self.issues)}, Issues: {str_issues}')

# Classe de utilidades para manipular o servidor Jira
class JiraUtils:
  def __init__(self, project, jira_instance):
    self.project = project
    self.jira_jira_instance = jira_instance

  def generate_intervals_between_dates(self, date1: tuple, date2: tuple, distance=120) -> list:
    start_date = datetime(date1[0], date1[1], date1[2])
    end_date = datetime(date2[0], date2[1], date2[2])
    interval_days = distance
    # Initialize a list to store the intervals
    intervals = []
    # Initialize the current date as the start date
    current_date = start_date
    # Loop to generate intervals until the current date is less than or equal to the end date
    while current_date < end_date:
        interval = (current_date, current_date + timedelta(days=interval_days - 1))
        intervals.append(interval)
        current_date += timedelta(days=interval_days)
    return intervals

  def convert_interval_dates(self, dates: list) -> list:
    list_interval_dates = []
    for each in dates:
      date1 = each[0]
      # Convert the date to a string in the format "YYYY/MM/DD".
      str_date1 = date1.strftime("%Y/%m/%d")
      date2 = each[1]
      str_date2 = date2.strftime("%Y/%m/%d")
      elemento = str_date1, str_date2
      list_interval_dates.append(elemento)
    return list_interval_dates

  def generate_list_of_sentences(self, dates: list) -> list:
    lista_sentencas = []
    for each in dates:
      str_date1 = each[0].strftime("%Y/%m/%d")
      str_date2 = each[1].strftime("%Y/%m/%d")
      sentenca = f'project={self.project.upper()} and created>="{str_date1}" and created<="{str_date2}"'
      lista_sentencas.append(sentenca)
    return lista_sentencas

  def get_list_of_block_issues_by_dates(self,date1, date2, distance=120) -> list:
    print('Aguarde...')
    t1 = datetime.now()
    list_of_dates = self.generate_intervals_between_dates(date1,date2,distance)
    lista_sentencas = self.generate_list_of_sentences(list_of_dates)
    lista_bloco_issues_by_date = []
    total_items = len(lista_sentencas)
    i = 0
    iterable_lista_sentencas = tqdm.tqdm(lista_sentencas, total=total_items)
    for each in iterable_lista_sentencas:
      issues_by_date_temp = self.jira_jira_instance.search_issues(each,maxResults=1000)
      print(f'Range: {each}, qtd issues: {len(issues_by_date_temp)}')
      lista_bloco_issues_by_date.append(issues_by_date_temp)
      percentage = (i + 1) / total_items * 100
      iterable_lista_sentencas.set_description(f"Progress Message Analysis")
    i += 1
    t2 = datetime.now()
    print(t2)
    print(f'Tempo da consulta: {t2-t1}')
    return lista_bloco_issues_by_date

  def concatenate_block_of_issues(self,block_of_issues):
    concatenated_list = [item for sublist in block_of_issues for item in sublist]
    print(f'Total de issues recuperados: {len(concatenated_list)}')
    return concatenated_list

class IssuesDatabase:
    def __init__(self, database_name):
        self.database_name = database_name
        self.create_tables()

    def create_tables(self):
        self.conn = sqlite3.connect(self.database_name)
        self.cursor = self.conn.cursor()

        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS Issues (
                id INTEGER PRIMARY KEY,
                project TEXT,
                key TEXT,
                summary TEXT,
                issue_type TEXT,
                status TEXT,
                priority TEXT,
                description TEXT
            )
        ''')

        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS Comments (
                id INTEGER PRIMARY KEY,
                key TEXT,
                comment TEXT
            )
        ''')

        self.conn.commit()

    def insert_in_table_issues(self, project, key, summary, issue_type, status, priority, description):
        values = (None, project, key, summary, issue_type, status, priority, description)
        self.cursor.execute('''
            INSERT INTO Issues
            (id, project, key, summary, issue_type, status, priority, description)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?)
        ''', values)

        self.conn.commit()

    def insert_in_table_comments(self, key, comment):
        values = (None, key, comment)
        self.cursor.execute('''
            INSERT INTO Comments
            (id, key, comment)
            VALUES (?, ?, ?)
        ''', values)

        self.conn.commit()

    def show_content(self, table):
        query = f"SELECT * FROM {table}"
        self.cursor.execute(query)

        rows = self.cursor.fetchall()
        for row in rows:
            print(row)

    def show_n_lines(self, table, n):
        query = f"SELECT * FROM {table}"
        self.cursor.execute(query)

        rows = self.cursor.fetchall()

        for i, row in enumerate(rows):
            print(row)
            if i == n:
              break

    def close_connection(self):
        self.conn.close()

def analyze_jira_issues_for_satd(project, satd_keywords, all_issues):
    # Create an instance of JiraIssues to manage SATD issues
    satd_issues = JiraIssues(project, [])

    total_items = len(all_issues)

    # Iterate through the fetched issues
    for issue in tqdm.tqdm(all_issues, total=total_items, desc='Progress jira issues analysis'):
        issue_key = issue.key
        issue_summary = issue.fields.summary
        issue_description = issue.fields.description
        issue_comments = [comment.body for comment in issue.fields.comment.comments]

        # Check for SATD keywords in the issue's summary, description, and comments
        if issue_summary is None:
            issue_summary = ""
        if issue_description is None:
            issue_description = ""
        if issue_comments is None:
            issue_comments = ""

        is_satd = any(keyword in (issue_summary + issue_description + ' '.join(issue_comments)) for keyword in satd_keywords)

        if is_satd:
            issue_type = issue.fields.issuetype.name
            issue_status = issue.fields.status.name
            issue_priority = issue.fields.priority.name

            # Create a JiraIssue instance for the SATD issue
            satd_issue = JiraIssue(issue_key, issue_summary, issue_type, issue_status, issue_priority, issue_description, issue_comments)

            # Add the SATD issue to the list of SATD issues
            satd_issues.add_issue(satd_issue)

    return satd_issues


# 3. Clones the local repository

In [None]:
print(f'Clona o repositório : {url_to_repository}')
!git clone $MY_REPOSITORY

Clona o repositório : https://github.com/apache/cassandra.git
Cloning into 'cassandra'...
remote: Enumerating objects: 400963, done.[K
remote: Counting objects: 100% (4054/4054), done.[K
remote: Compressing objects: 100% (1567/1567), done.[K
remote: Total 400963 (delta 1871), reused 3622 (delta 1757), pack-reused 396909[K
Receiving objects: 100% (400963/400963), 398.31 MiB | 24.70 MiB/s, done.
Resolving deltas: 100% (233498/233498), done.
Updating files: 100% (6130/6130), done.


# 4. Analyze commits

In [None]:
satd_keywords = {
    "typo", "unused import", "error message", "comment", "logging", "javadoc", "minor", "update",
    "debug", "code cleanup", "formatting", "more tests", "documentation", "work in progress",
    "improvement", "rename", "support for", "header", "interface", "annotation", "naming", "class",
    "tidy up", "files", "extension point", "exception", "handling", "test", "output", "cast",
    "simplify", "findbugs", "leak", "implementation", "unused code", "API", "refactoring",
    "checkstyle errors", "redundant", "deprecated code", "constructor", "endpoints", "flaky",
    "unused", "unnecessary", "confusing", "ugly", "simplify", "too much", "not used", "more readable",
    "more efficient", "dead code", "infinite loop", "too long", "not implemented", "less verbose",
    "more robust", "speed up", "get rid of", "not thread safe", "clean up code", "not done yet",
    "avoid extra seek", "reduce duplicate code", "no longer needed", "not supported yet",
    "documentation doesn't match", "short term solution", "spurious error messages", "it'd be nice",
    "please add a test", "would significantly improve", "performance", "makes it much easier",
    "avoid calling it twice", "takes a long time", "good to have coverage",
    "makes it very hard", "patch doesn't apply cleanly", "it's not perfectly documented",
    "need to update documentation", "make it less brittle", "documentation does not mention",
    "wastes a lot of space", "there is no unit test", "lead to huge memory allocation",
    "test doesn't add much value", "some holes in the doc", "by hard coding instead of",
    "should be updated to reflect", "more tightly coupled than ideal", "any chance of a test",
    "should improve a bit by", "solution won't be really satisfactory", "misleading",
    "too long", "please add a test", "there is no unit test", "any chance of a test",
    "good to have coverage", "test doesn't add much value", "flaky"
}

satd_keywords_temp = {'TODO', 'FIXME', 'refactor', 'hack', 'workaround', 'technical debt', 'cleanup', 'clean', 'fix'}

my_satd_keywords = satd_keywords_temp.union(satd_keywords)

In [None]:
start_date = datetime(2021, 1, 1, 0, 0, 0)
end_date = datetime(2021, 12, 30, 0, 0, 0)

my_satd_commit_analyzer = SATDCommitAnalyzer(path_to_repository)
commits_with_satd, dict_commit_msg = my_satd_commit_analyzer.analyze_commits_for_satd(start_date, end_date, my_satd_keywords)

Aguarde...


Progress commit analysis: 100%|██████████| 1197/1197 [00:00<00:00, 1436.43it/s]


In [None]:
print('Mostra um fragmento do dicionario de commits e msgs')
cont = 0
for k, v in dict_commit_msg.items():
  print(f'k: {k}, v: {v}')
  if cont == 10:
    break
  cont += 1

Mostra um fragmento do dicionario de commits e msgs
k: 457422ac1acc3c6e13c3738410def539af64d264, v: Use ubuntu2004_* docker testing images, from the apache organisation in dockerhub

 patch by Mick Semb Wever; reviewed by Sam Tunnicliffe for CASSANDRA-16373
k: f02e53568dbc193b7ac75cc19b0a7751d5514b95, v: Improve empty/corrupt hint file handling on startup

Patch by marcuse; reviewed by Benjamin Lerer and Yifan Cai for CASSANDRA-16162
k: 92182bcc23ddb51d758aed9df97f27644cd777c6, v: build.xml first test commit
k: 7214794e3f86b67d96371c78815c1ce03b3a5d6b, v: ninja fix CHANGES.txt
k: 12b610246bc42dc6af33abfe0885b2f989fc2c73, v: Don't manually remove endpoints in distributed tests

Patch by brandonwilliams, reviewed by ycai and edimitrova for
CASSANDRA-16229
k: 7f1659cd1d46ab8904eee99daefcaaa7a521e00b, v: Upgrade netty and chronicle-queue dependencies to get Auditing and native library loading working on arm64 architectures

 CASSANDRA-16384 test case AuditLoggerTest fail on aarch64 platfor

## 4.1 Analyzes commit diffs

In [None]:
start_date = datetime(2021, 1, 1, 0, 0, 0)
end_date = datetime(2021, 12, 30, 0, 0, 0)

dict_commit_diffs = my_satd_commit_analyzer.analyze_commit_diffs_for_satd(start_date, end_date, my_satd_keywords)

Aguarde...


Progress commit analysis: 100%|██████████| 1197/1197 [00:35<00:00, 33.67it/s]


In [None]:
cont = 0
for k,v in dict_commit_diffs.items():
  print(f'k: {k}, v: {v}')
  if cont == 10:
    break
  cont += 1

k: f02e53568dbc193b7ac75cc19b0a7751d5514b95, v: (['handling'], [' * Improve empty hint file handling during startup (CASSANDRA-16162)'])
k: 661f1aab171dc3ef16075f69581e88ad4a133fae, v: (['test'], ['        // Needed to stabilize sstable count for off-cache sized tests (e.g. count = 100_000_000)'])
k: 0a1e900a0a042f78d7d5d6625bc98b84eb463e69, v: (['update', 'update', 'update', 'update', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean', 'clean'], ['         * then also update reclaiming since the flush operation is waiting at the barrier for in-flight writes,', '         * then also update reclaiming since the flush operation is waiting at the barrier for in-flight writes,', '         * If the state is still live, then we update the memory we own here and in the parent.', '         * However, if the state is not live, we do not update it because we would have to update', '     

# 5. Analyzes Jira issues

## 5.1 Configures the instance, project and date range

In [None]:
# Initialize the Jira connection
print('Initialize the Jira connection')
jira = JIRA(JIRA_SERVER, basic_auth=(username, password))

# Create a JiraUtils instance
print('Create a JiraUtils instance')
jira_utils = JiraUtils('CASSANDRA', jira)

# Define date intervals
print('Define date intervals')
date1 = (2023, 1, 1)
date2 = (2023, 9, 1)
distance = 120

Initialize the Jira connection
Create a JiraUtils instance
Define date intervals


## 5.2 Retrieves all configuration issues

In [None]:
# Fetch issues using date intervals
print('Fetch issues using date intervals')
block_of_issues = jira_utils.get_list_of_block_issues_by_dates(date1, date2, distance)

Fetch issues using date intervals
Aguarde...


Progress Message Analysis:  33%|███▎      | 1/3 [00:04<00:09,  4.51s/it]

Range: project=CASSANDRA and created>="2023/01/01" and created<="2023/04/30", qtd issues: 360


Progress Message Analysis:  67%|██████▋   | 2/3 [00:08<00:04,  4.01s/it]

Range: project=CASSANDRA and created>="2023/05/01" and created<="2023/08/28", qtd issues: 309


Progress Message Analysis: 100%|██████████| 3/3 [00:09<00:00,  3.07s/it]

Range: project=CASSANDRA and created>="2023/08/29" and created<="2023/12/26", qtd issues: 96
2023-10-01 20:33:16.962546
Tempo da consulta: 0:00:09.218923





In [None]:
# Concatenate the block of issues into a single list
print('Concatenate the block of issues into a single list')
all_issues = jira_utils.concatenate_block_of_issues(block_of_issues)

Concatenate the block of issues into a single list
Total de issues recuperados: 765


## 5.3 Filter SATD issues

In [None]:
project = 'CASSANDRA'

# Fetch 'all_issues' from your Jira server

# Call the function to analyze Jira issues for SATD
satd_issues = analyze_jira_issues_for_satd(project, my_satd_keywords, all_issues)

Progress jira issues analysis: 100%|██████████| 765/765 [00:00<00:00, 8360.30it/s]


In [None]:
for i,each in enumerate(satd_issues.get_issues()):
  print(i+1, each)

1 Key: CASSANDRA-18488, Summary: WEBSITE - Replace homepage, Events page banners with C* Summit, Type: Task, Status: Resolved
2 Key: CASSANDRA-18487, Summary: MappedByteBufferIndexInputProvider can better throw UndeclaredThrowableException, Type: Bug, Status: Resolved
3 Key: CASSANDRA-18486, Summary: LeveledCompactionStrategy does not check its constructor, Type: Bug, Status: Resolved
4 Key: CASSANDRA-18485, Summary: CEP-15: (C*) Enhance in-memory FileSystem to work with mmap and support tests to add custom logic, Type: Improvement, Status: Resolved
5 Key: CASSANDRA-18484, Summary: FunctionCall can throw more specific exceptions, Type: Bug, Status: Open
6 Key: CASSANDRA-18483, Summary: jvm dtest logs should indicate specific test case name being run, Type: Improvement, Status: Open
7 Key: CASSANDRA-18482, Summary: Test Failure: HintsDisabledTest.testHintedHandoffDisabled, Type: Improvement, Status: Resolved
8 Key: CASSANDRA-18481, Summary: Add support for Lucene index and query analyze