<a href="https://colab.research.google.com/github/Valerii3/OA_CodeReviewer/blob/main/CodeReviewer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automating Code Review Activities
---

# Introduction:

Code review is an integral part of the software development cycle, ensuring the code's quality, functionality, and maintainability. The paper "Automating Code Review Activities by Large-Scale Pre-training from FSE 2022" introduces the CodeReviewer model, aiming to automate the process by leveraging a pre-trained model on a vast dataset.

This notebook aims to implement and evaluate the CodeReviewer model on a chosen GitHub repository.

# Setup:
Before we proceed, let's install all the required dependencies:

In [None]:
!pip install requests
!pip install PyGithub
!pip install transformers
!pip install torch
!pip install nltk
!pip install rouge

# Preprocessing Diff Hunks using Special Tokens

In the research paper, the authors highlighted the significance of preprocessing the data for enhancing the model's efficiency. They introduced a unique approach by substituting the traditional symbols used in diff hunks (like +, -, and space) with specific tokens such as ADD, DEL, and KEEP. This method, as per the paper, resulted in improved model performance.

The following function, sourced from the authors' GitHub repository, implements this preprocessing:

In [3]:
def add_special_tokens(diff_hunk: str):
    diff_lines = diff_hunk.split("\n")[1:]        # remove start @@
    diff_lines = [line for line in diff_lines if len(line.strip()) > 0]
    map_dic = {"-": 0, "+": 1, " ": 2}
    def f(s):
        if s in map_dic:
            return map_dic[s]
        else:
            return 2
    labels = [f(line[0]) for line in diff_lines]
    diff_lines = [line[1:].strip() for line in diff_lines]
    input_str = ""
    for label, line in zip(labels, diff_lines):
        if label == 1:
            input_str += "<add>" + line
        elif label == 0:
            input_str += "<del>" + line
        else:
            input_str += "<keep>" + line
    return input_str


# Collecting Review Comments from the pandas Repository

For the purpose of our analysis, it's crucial to have a dataset that's representative of real-world code review comments. Given the stature and the quality of the code in the pandas repository, it serves as an ideal candidate for our data collection.

We are particularly interested in recent review comments to ensure our analysis is based on the most up-to-date practices.

To efficiently gather this data, the PyGithub library is employed, which provides a seamless interface to interact with the GitHub API using Python.

This code facilitates the extraction of the most recent 100 review comments from the pandas repository. It focuses on Python files and ensures the uniqueness of the diffs for a fair comparison between generated and actual comments.

In [4]:
from github import Github

TOKEN = "TOKEN"  # Replace with your token

# Initialize the Github object with the token
g = Github(TOKEN)

repo = g.get_repo("pandas-dev/pandas")

# Fetch recent pull requests
pulls = repo.get_pulls(state="open", sort="created", direction="desc")  # Fetching open PRs sorted by creation time in descending order

filtered_comments = []
processed_diffs = set()  # To keep track of the processed hunks

# Iterate through the PRs and fetch review comments
it = 0
for pr in pulls:
    comments = pr.get_review_comments()
    for comment in comments:
        if not comment.path.endswith('.py'):  # Check if the file is a Python file
            continue
        # Check if comment has no replies and its patch is <= 1000 characters
        if comment.in_reply_to_id is None and len(comment.diff_hunk) <= 1000:
            preprocessed_patch = add_special_tokens(comment.diff_hunk)

            # Check if this diff has been processed before
            if preprocessed_patch in processed_diffs:
                continue  # Skip this diff

            filtered_comments.append({
                "patch": preprocessed_patch,
                "msg": comment.body,
                "id": comment.id,
                "file": comment.path
            })

            # Add this diff to the set of processed diffs
            processed_diffs.add(preprocessed_patch)

        it += 1
        if it == 100:
            break


In our project, this model will be harnessed to generate review comments for the preprocessed diffs obtained from the pandas repository.


In [None]:
from transformers import pipeline
import torch

# Load the model and perform inference
pipe = pipeline("text2text-generation", "microsoft/codereviewer", max_length=200)

preprocessed_diffs = [item["patch"] for item in filtered_comments]
generated_comments = pipe(preprocessed_diffs)

# Storing the results
predictions_and_targets = []
for i, output in enumerate(generated_comments):
    pred_and_target = {
        "pred": output['generated_text'],
        "target": filtered_comments[i]['msg'],
        "id": filtered_comments[i]['id']
    }
    predictions_and_targets.append(pred_and_target)


Display the results


In [12]:
for data in predictions_and_targets[90:95]:
    original_diff = [item["patch"] for item in filtered_comments if item["id"] == data["id"]][0]
    print(f"Diff Code (Preprocessed):\n{original_diff}\n")
    print(f"Original Comment: {data['target']}\n")
    print(f"Generated Comment: {data['pred']}\n")
    print("< ================ >")

Diff Code (Preprocessed):
<keep>def make_na_array(dtype: DtypeObj, shape: Shape, fill_value) -> ArrayLike:<keep>if isinstance(dtype, DatetimeTZDtype):<keep># NB: exclude e.g. pyarrow[dt64tz] dtypes<del>i8values = np.full(shape, fill_value._value)<add>i8values = np.full(shape, Timestamp(fill_value)._value)

Original Comment: we need to be sure that the Timestamp here has the same unit as the dtype

Generated Comment: <msg>This is the only place where we need to convert a `Timestamp` to a `DatetimeTZDtype`.

Diff Code (Preprocessed):
<keep>os.chdir(dirname)<keep>self._run_os("zip", zip_fname, "-r", "-q", *fnames)<add>def _linkcheck(self):<add>"""<add>Check for broken links in the documentation.<add>"""<add>cmd = ["sphinx-build", "-b", "linkcheck"]<add>if self.num_jobs:<add>cmd += ["-j", self.num_jobs]<add>if self.verbosity:<add>cmd.append(f"-{'v' * self.verbosity}")<add>cmd += [<add>"-d",<add>os.path.join(BUILD_PATH, "doctrees"),<add>SOURCE_PATH,<add>os.path.join(BUILD_PATH, "linkcheck")

# Qualitative Analysis Summary:

* **Relevance to Original Comment**:

  The model's generated comments sometimes align with the intent of the original comments but might express the idea differently.
  There are instances where the model's focus deviates from the core concern of the original comment.

* **Tendency for Specific Code Suggestions**:

  The model frequently provides specific code suggestions or fixes, even when the original comment might be addressing broader concerns or general observations.

* **Accuracy Categories**:

  Accurate: The model's comment aligns well with the intent of the original comment.

  Partially Accurate: The model captures some essence but misses certain key aspects.

  Inaccurate: The model's comment deviates significantly from the original or focuses on a different aspect of the code.

* **General Observation**:

  The model appears to understand code context to some extent, but there's room for improvement in aligning its responses with human reviewers' insights.

# Quantitative Analysis

For a quantitative measure of how well the generated comments match with the original ones, we employ the BLEU score metric.

Originally devised for evaluating machine translations, the BLEU score is also versatile enough to be applied in various text generation tasks, including our case of auto-generated code comments.

In [14]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Assuming predictions_and_targets is your data
original_comments = [item['target'] for item in predictions_and_targets]
generated_comments = [item['pred'] for item in predictions_and_targets]

# Compute BLEU scores
bleu_scores = []

smoother = SmoothingFunction()

for original, generated in zip(original_comments, generated_comments):
    reference = [original.split()]  # the reference and candidate should be tokenized
    candidate = generated.split()
    score = sentence_bleu(reference, candidate, smoothing_function=smoother.method1)  # Using smoothing since BLEU can be sensitive to short texts
    bleu_scores.append(score)

average_bleu = sum(bleu_scores) / len(bleu_scores)

print(f'Average BLEU score for the generated comments: {average_bleu:.4f}')


Average BLEU score for the generated comments: 0.0092


# BLEU Metric Analysis

Our obtained average BLEU score of 0.0092 suggests a low degree of word overlap between the generated and original comments. While BLEU is efficient in capturing exact word matches, it doesn't necessarily reflect the semantic accuracy or the quality of the content in relation to the context.

Given the nuances of our task and the inherent limitations of BLEU, it's crucial to employ another evaluation metric to gain a more comprehensive insight into the performance of the CodeReviewer model. This leads us to the ROUGE metric.

 By using ROUGE, we aim to get a broader understanding of how closely the generated review comments match the original ones not just in terms of exact word matches, but also in structure and content.

In [33]:
from rouge import Rouge

original_comments = [item["target"] for item in predictions_and_targets]
generated_comments_list = [item["pred"] for item in predictions_and_targets]

generated_comments_list = [comment.strip() for comment in generated_comments_list if comment.strip()]
original_comments = [comment.strip() for comment in original_comments if comment.strip()]

# Trim both lists to only have the first 50 comments
generated_comments_list = generated_comments_list[:100]
original_comments = original_comments[:100]

# Now compute the ROUGE scores
rouge = Rouge()
scores = rouge.get_scores(generated_comments_list, original_comments, avg=True)

# Printing the results
print(f"ROUGE-1: {scores['rouge-1']}")
print(f"ROUGE-2: {scores['rouge-2']}")
print(f"ROUGE-L: {scores['rouge-l']}")




ROUGE-1: {'r': 0.05995802890457184, 'p': 0.10664973101611024, 'f': 0.0683260733614867}
ROUGE-2: {'r': 0.0024125541125541126, 'p': 0.008469696969696969, 'f': 0.0036003641603207935}
ROUGE-L: {'r': 0.057734406122125534, 'p': 0.10168941355579281, 'f': 0.06537170546775711}


  # ROUGE-1 (Unigrams):
  Recall (r): About 5.995% of the words in the reference comments are also present in the generated comments.
  
  Precision (p): About 10.665% of the words in the generated comments are also present in the reference comments.
  
  F1-score (f): The harmonic mean of precision and recall is about 6.833%.
  
  # ROUGE-2 (Bigrams):
Recall (r): About 0.241% of the bigrams in the reference comments are also present in the generated comments.

 Precision (p): About 0.847% of the bigrams in the generated comments are also present in the reference comments.

 F1-score (f): The harmonic mean of precision and recall is about 0.36%.

# ROUGE-L (Longest Common Subsequence):

Recall (r): The longest common subsequence makes up about 5.773% of the reference comments.

Precision (p): The longest common subsequence makes up about 10.169% of the generated comments.

F1-score (f): The harmonic mean of precision and recall is about 6.537%.

# Project Summary: Code Review Comment Generation and Analysis
---
# Objective:
 * Automate the generation of code review comments using machine learning.

 * Evaluate the quality of the generated comments against actual review comments from the pandas GitHub repository.

# Data Collection:
 * Utilized the PyGithub library to fetch the last 100 review comments from pandas-dev/pandas repository.
 * Filtered reviews specifically targeting Python files, unique diffs, and diffs not exceeding 1000 characters.

# Model for Comment Generation:
* Employed the model microsoft/codereviewer from the transformers library to predict comments based on preprocessed diff patches.
* Generated comments for the selected diffs.

# Qualitative Analysis:
 * Compared original and generated comments for a few sample diffs.
 * Observations:
      * Some generated comments were relevant and captured the essence of the original comments, but there were also discrepancies.
      * In certain cases, the generated comments diverged from the context or missed certain nuances.

# Quantitative Analysis:

* BLEU Score:
  * Used to evaluate the similarity of the generated comments with the original ones.
  * Average BLEU score: 0.0092, which indicates a low overlap between the generated comments and the original comments.

* ROUGE Score:
  * Implemented to capture word and n-gram overlaps and evaluate more detailed content similarities.
   * Results:
                ROUGE-1: r: 5.995%, p: 10.665%, f: 6.833%
                ROUGE-2: r: 0.241%, p: 0.847%, f: 0.36%
                ROUGE-L: r: 5.773%, p: 10.169%, f: 6.537%
   * The ROUGE scores further highlighted the minimal overlap between generated and reference comments, particularly in terms of phrasing and structure.

# Conclusion:
 * The machine learning model showed potential in generating code review comments.
 * However, there's room for improvement, given the low BLEU and ROUGE scores.
 * Further refinements, more extensive training, or additional context might enhance the model's performance.


