### Data Preprocessing

As the initial step, we decided to preprocess both the training data and test data by
1. Removing review comments which have high similarity but different class labels in **train data only**.
2. Removing review comments that comprise keywords from all three classes, `insert`, `delete`, `replace` in **both train and test data**
3. Removing vague review comments in **both train and test data** which do not contains any meaningful context for the model to learn from.

We acknowledged that performing preprocessing steps on test data is sometimes referred as **data leakage** or **test set contamination**, which is not the best practice in the industry. However, in the context of code review comment, these filters are essential and necessary in order for the review comments that are possibly ambiguous to be manually reviewed by the Senior Developer. 


#### Step 1: Import Libraries and Defining Self-Created Dictionary

##### Required Libraries

In [1]:
# Import necessary libraries
import json
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from textblob import TextBlob

##### Dictionary Created

In [2]:
dictionary = {
    'delete': [
        r'\b(remove|removed|delete|deleted|drop|dropped|eliminate|eliminated)\b',
        r'\b(unnecessary|redundant)\b',
        r'\b(get\s+rid\s+of|got\s+rid\s+of)\b',
        r'\b(clean\s+up|cleaned\s+up)\b',
        r'\b(nuke|kill|killed)\b',
        r'\b(wrong|extra|space|trailing|bad|long)\b'
    ],
    'replace': [
        r'\b(change|changed|replace|replaced|rename|renamed|convert|converted)\b',
        r'better\s+to\s+use',
        r'\b(misplace|misplaced|misuse|misused)\b',
        r'\b(lowercase|uppercase|spelling|punctuation|punctuations)\b'
    ],
    'insert': [
        r'\b(add|added|insert|inserted|include|included)\b',
        r'\b(missing|need|needed)\b',
        r'needs?\s+to\s+be\s+added',
        r'put\s+in',
        r'short',
        r'tab',
        r'new'
    ]
}

#### Step 2: Filtering Functions

##### Step 2.1 Removing Similar Review Comments with Different Labels 


There are two stages within this filter function:
1. Finding exact duplicates with different labels. 
2. Finding review comments with weighted word-level similarities and character-level similarities **(7:3)** that exceed the `similarity_threshold` set. 

In [3]:
def similar_comments_w_diff_label(df, similarity_threshold=0.6):
    """Find similar comments with different class labels."""
    print("Normalizing comments...")
    df['comment_normalized'] = df['Review Comment'].str.lower().str.strip()

    # Find exact duplicates with different labels
    print("Finding exact duplicates...")
    duplicates = df.groupby('comment_normalized')['Expected Operation by Developer'].agg(set)
    ambiguous_exact = duplicates[duplicates.apply(len) > 1].index.tolist()

    # Find similar comments using character-level similarities
    print("Calculating character-level similarities...")
    char_vectorizer = TfidfVectorizer(
        analyzer='char',
        ngram_range=(2, 4),
        strip_accents='unicode'
    )
    char_tfidf = char_vectorizer.fit_transform(df['comment_normalized'])
    char_similarities = cosine_similarity(char_tfidf)

    # Find similar comments using word-level similarities
    print("Calculating word-level similarities...")
    word_vectorizer = TfidfVectorizer(
        strip_accents='unicode',
        ngram_range=(1, 3),
        max_features=10000
    )
    word_tfidf = word_vectorizer.fit_transform(df['comment_normalized'])
    word_similarities = cosine_similarity(word_tfidf)

    # Combine similarities with more weight on word similarities
    similarities = (0.7 * word_similarities + 0.3 * char_similarities)

    ambiguous_pairs = []
    n_samples = len(df)
    for i in range(n_samples):
        for j in range(i + 1, n_samples):
            if similarities[i, j] > similarity_threshold:
                if df.iloc[i]['Expected Operation by Developer'] != df.iloc[j]['Expected Operation by Developer']:
                    ambiguous_pairs.append({
                        'comment1': df.iloc[i]['Review Comment'],
                        'label1': df.iloc[i]['Expected Operation by Developer'],
                        'comment2': df.iloc[j]['Review Comment'],
                        'label2': df.iloc[j]['Expected Operation by Developer'],
                        'similarity_score': similarities[i, j]
                    })

    # Sort pairs by similarity score
    ambiguous_pairs.sort(key=lambda x: x['similarity_score'], reverse=True)

    # Get all ambiguous comments
    ambiguous_comments = set(ambiguous_exact)  # Start with exact duplicates
    for pair in ambiguous_pairs:
        ambiguous_comments.add(pair['comment1'].lower().strip())
        ambiguous_comments.add(pair['comment2'].lower().strip())

    return list(ambiguous_comments), ambiguous_pairs

##### Step 2.2 Removing Review Comments with Keywords from All Three Class Labels


We will check through a comment and look for the existence of keywords defined in the dictionary. Within each comment, if there are keywords found each class label, we will considered the `strong_indicators` now has the value of 1. If the `strong_indicators` sum up to be 3, which strongly indicates the possibility of a comment being confusing, and thus we remove it.

In [4]:
def has_all_keywords(comment):
    """Check if comment contains keywords from dictionary that we defined for each class label."""
    comment = str(comment).lower()
    matches = {}
    for op, patterns in dictionary.items():
        matches[op] = sum(bool(re.search(pattern, comment)) for pattern in patterns)
    strong_indicators = sum(count > 0 for count in matches.values())
    return strong_indicators > 2

##### Step 2.3 Removing Vague or Contextless Review Comments


There are two checks within this filter function:
1. If the review comment is short `len(words)<4` and contains open-ended keywords.
2. If the review comment is very short `len(words)<3` and does not contain keywords for any of the class label defined in the dictionary above.

We will remove the comment if it marks any of these two checks.

In [5]:
def is_vague_or_contextless(comment):
    """Check for vague or contextless comments."""
    comment = str(comment).lower()
    words = comment.split()

    questionable_patterns = [
        r'^same\s+(as|like|with)',
        r'^(as|like)\s+above',
        r'^(as|like)\s+before',
        r'^(as|like)\s+previous',
        r'similar\s+to',
        r'^why',
        r'^what',
        r'^when',
        r'^where',
        r'^how',
        r'^these',
        r'^this',
        r'^those',
        r'^that',
        r'^shouldn\'t\s+',
        r'^shouldnt\s+',
        r'^should\s+not',
        r'^couldn\'t\s+',
        r'^couldnt\s+',
        r'^could\s+not',
        r'^wouldn\'t\s+',
        r'^wouldnt\s+',
        r'^would\s+not',
    ]

    has_questionable = any(re.search(pattern, comment) for pattern in questionable_patterns) and len(words) < 4

    keyword = 0
    for _, patterns in dictionary.items():
        keyword += sum(bool(re.search(pattern, comment)) for pattern in patterns)
    contextless = len(words) < 3 and keyword < 1

    return has_questionable or contextless


#### Step 3: Correcting Abbreviations and Typos 


In case the dataset contains abbreviations or typos (due to human errors), we will correct them before the model training so that the data integrity is preserved and do not interfered with the learning process of the model.

In [6]:
# Load abbreviations
def load_abbreviations():
    """Load and combine code and natural language abbreviations."""
    # Load code abbreviations
    with open("../Model and Dataset/abbreviations-in-code/data/abbrs/.json", 'r') as file:
        abbreviations = json.load(file)
        dict_for_abbreviations_code = {i['abbrs'][0]['abbr']: i['word'] for i in abbreviations}

    # Add natural language abbreviations
    dict_for_abbreviations_nl = {
        'fwiw': "for what it's worth",
        'u': "you",
        "ur": "your",
        "nit": "Nitpicking"
    }

    return {**dict_for_abbreviations_code, **dict_for_abbreviations_nl}


# Clean text function
def clean_text(text, abbreviations_dict):
    """Replace abbreviations in the text."""
    # Replace abbreviations
    words = text.split()
    words = [abbreviations_dict.get(word.lower(), word) for word in words]
    text = ' '.join(words)
    # Correct typos
    text = str(TextBlob(text).correct())
    return text


#### Step 4: Integrate All Filtering Functions 

In [7]:
# Dataset cleaning function
def clean_dataset(df, is_train, save_details=True, verbose=True):
    """Enhanced dataset cleaning with multiple filtering steps."""
    original_size = len(df)

    # Load abbreviations
    print("Loading abbreviations...")
    abbreviations_dict = load_abbreviations()

    # Initialize masks
    similarity_mask = pd.Series(False, index=df.index)
    contradictory_mask = pd.Series(False, index=df.index)
    ambiguous_mask = pd.Series(False, index=df.index)

    if is_train:
        print("Finding similar comments with different labels...")
        ambiguous_comments, ambiguous_pairs = similar_comments_w_diff_label(df)
        similarity_mask = df['Review Comment'].str.lower().str.strip().isin(ambiguous_comments)

        # Remove ambiguous comments
        df_cleaned = df[~similarity_mask].copy()
    else:
        df_cleaned = df.copy()

    print("Applying additional filters...")
    contradictory_mask_cleaned = df_cleaned['Review Comment'].apply(has_all_keywords)
    ambiguous_mask_cleaned = df_cleaned['Review Comment'].apply(is_vague_or_contextless)

    combined_mask_cleaned = contradictory_mask_cleaned | ambiguous_mask_cleaned
    cleaned_df = df_cleaned[~combined_mask_cleaned].copy()

    # Apply text cleaning and typo correction to the final cleaned dataset
    print("Applying text cleaning and typo correction...")
    cleaned_df['Review Comment'] = cleaned_df['Review Comment'].apply(
        lambda x: clean_text(x, abbreviations_dict)
    )

    # Save details if needed
    if save_details and is_train:
        print("Saving details...")
        # Prepare 'removed_df' with comments removed due to similar comments with different labels 
        removed_similar_df = df[similarity_mask].copy()
        removed_similar_df['Removal_Reason'] = 'Similar'

        removed_contradictory_ambiguous_df = df_cleaned[combined_mask_cleaned].copy()
        removed_contradictory_ambiguous_df['Removal_Reason'] = 'Contradictory/Ambiguous'

        removed_df = pd.concat([removed_similar_df, removed_contradictory_ambiguous_df], ignore_index=True)

        # Save ambiguous pairs
        pd.DataFrame(ambiguous_pairs).to_csv('ambiguous_pairs.csv', index=False)
        print("Ambiguous pairs saved to 'ambiguous_pairs.csv'.")

        # Save removed examples
        removed_df.to_csv('removed_examples.csv', index=False)
        print("Removed examples saved to 'removed_examples.csv'.")

    # Verbose output
    if verbose:
        total_removed = original_size - len(cleaned_df)
        print("\nCleaning Results:")
        print(f"Original dataset size: {original_size}")
        print(f"Cleaned dataset size: {len(cleaned_df)}")
        print(f"Total removed: {total_removed} ({(total_removed / original_size) * 100:.1f}%)")

        if is_train:
            print(f"- Similar: {similarity_mask.sum()} ({(similarity_mask.sum() / original_size) * 100:.1f}%)")
            print(f"- Contradictory/Ambiguous: {combined_mask_cleaned.sum()} ({(combined_mask_cleaned.sum() / original_size) * 100:.1f}%)")
        else:
            print(f"- Contradictory/Ambiguous: {combined_mask_cleaned.sum()} ({(combined_mask_cleaned.sum() / original_size) * 100:.1f}%)")

        print("\nClass distribution before cleaning:")
        print(df['Expected Operation by Developer'].value_counts(normalize=True))
        print("\nClass distribution after cleaning:")
        print(cleaned_df['Expected Operation by Developer'].value_counts(normalize=True))

    return cleaned_df


#### Step 5: Run the main() function

In [8]:
# Main function for data cleaning
def main():
    # Load data
    print("Loading data...")
    train_df = pd.read_excel('../Model and Dataset/Train.xlsx')
    test_df = pd.read_excel('../Model and Dataset/Test.xlsx')

    print("\nCleaning training data...")
    clean_train_df = clean_dataset(train_df, is_train=True)
    clean_train_df.to_excel('clean_train.xlsx', index=False)
    print("Cleaned training data saved to 'clean_train.xlsx'.")

    print("\nCleaning test data...")
    clean_test_df = clean_dataset(test_df, is_train=False)
    clean_test_df.to_excel('clean_test.xlsx', index=False)
    print("Cleaned test data saved to 'clean_test.xlsx'.")

    print("\nCleaning process completed!")

if __name__ == "__main__":
    main()

Loading data...

Cleaning training data...
Loading abbreviations...
Finding similar comments with different labels...
Normalizing comments...
Finding exact duplicates...
Calculating character-level similarities...
Calculating word-level similarities...
Applying additional filters...
Applying text cleaning and typo correction...
Saving details...
Ambiguous pairs saved to 'ambiguous_pairs.csv'.
Removed examples saved to 'removed_examples.csv'.

Cleaning Results:
Original dataset size: 931
Cleaned dataset size: 875
Total removed: 56 (6.0%)
- Similar: 20 (2.1%)
- Contradictory/Ambiguous: 36 (3.9%)

Class distribution before cleaning:
Expected Operation by Developer
insert     0.345865
delete     0.331901
replace    0.322234
Name: proportion, dtype: float64

Class distribution after cleaning:
Expected Operation by Developer
insert     0.336000
delete     0.332571
replace    0.331429
Name: proportion, dtype: float64
Cleaned training data saved to 'clean_train.xlsx'.

Cleaning test data...
Lo