**reading data and set them into pandas df**

In [34]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [35]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/codeReview/3_preprocessing/fullDataset.csv')
df.head()

Unnamed: 0,description,Personal attacks,Threats or intimidation,Mockery,Lack of specificity,Discouragement without guide,Disregard for other time or boundaries,Unconscious bias,Dismissive attitude,Excessive control,Social,Toxic,AntiSocial,NonToxic
0,/It may happen that a service die/A service ma...,0,0,0,1,0,0,0,0,0,0,0,1,1
1,"@zhiyan, thanks for helping explanation. Overa...",0,0,0,0,0,0,0,0,0,1,0,0,1
2,all the code you have inline below should be r...,0,0,0,0,0,0,0,0,0,1,0,0,1
3,All you do in the interrupt handler is call wa...,1,0,1,1,0,1,0,0,0,0,0,1,1
4,Are you sure this leads to a color that makes ...,0,0,0,0,0,0,0,0,0,1,0,0,1


**1. Lowercasing:**

  Convert all text to lowercase to ensure uniformity and reduce the dimensionality of the data.

In [36]:
df['description'] = df['description'].str.lower()
df['description'].head()

0    /it may happen that a service die/a service ma...
1    @zhiyan, thanks for helping explanation. overa...
2    all the code you have inline below should be r...
3    all you do in the interrupt handler is call wa...
4    are you sure this leads to a color that makes ...
Name: description, dtype: object

**2. URL removal (URL-rem):**

 A code review comment may include an URL (e.g., reference to docu- mentation or a StackOverflow post). Although URLs are irrelevant for a antisociality classifier, they can increase the number of features for supervised classifiers. We used a regular expression matcher to identify and remove all URLs from our datasets.

In [37]:
import re

url_regex = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')


def remove_url(text):
    return url_regex.sub(" ", text)

df['description'] = df['description'].apply(remove_url)
df['description'][106]

'> this approach should cover your example (because user will\n  > get a 404). and that should also cover "other cases" (e.g.\n  > network problem between glance and the backend store)\n  > which the user shouldn\'t know/care about.\n  \n  it doesn\'t 404 was just an example, it can be anything, another example.\n  \n  - me: upload image  \n  - glance: notfound("image url failed fetch returned: <any http status that you can think about>")\n  - me: ohh shit my url is not working because ...\n  - me: upload image   uri>\n  \n  like i said if we agree that reason doesn\'t contain any security issue then why hide it at all, if it\'s not meant for the user (special case faulty switch) so be it, if it\'s user problem that a big win.'

**3. Contraction expansion (Cntr-exp):**

  Contractions, which are shortened form of one or two words, are common among code review texts. For example, some common words are: doesn’t →does not, we’re →we are. By creating two different lexicons of the same term, contractions increase the number of unique lexicons and add redundant features. We replaced the commonly used 154 contractions, each with its expanded version.

In [38]:
contraction_mapping = {"ain't": "is not", "aren't": "are not",
                       "can't": "cannot", "'cause": "because",
                       "could've": "could have", "couldn't": "could not",
                       "didn't": "did not", "doesn't": "does not",
                       "don't": "do not", "hadn't": "had not", "hasn't": "has not",
                       "haven't": "have not", "he'd": "he would", "he'll": "he will",
                       "he's": "he is", "how'd": "how did", "how'd'y": "how do you",
                       "how'll": "how will", "how's": "how is", "I'd": "I would",
                       "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have",
                       "I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have",
                       "i'll": "i will", "i'll've": "i will have", "i'm": "i am",
                       "i've": "i have", "isn't": "is not", "it'd": "it would",
                       "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have",
                       "it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not",
                       "might've": "might have", "mightn't": "might not",
                       "mightn't've": "might not have", "must've": "must have",
                       "mustn't": "must not", "mustn't've": "must not have",
                       "needn't": "need not", "needn't've": "need not have",
                       "o'clock": "of the clock", "oughtn't": "ought not",
                       "oughtn't've": "ought not have", "shan't": "shall not",
                       "sha'n't": "shall not", "shan't've": "shall not have",
                       "she'd": "she would", "she'd've": "she would have",
                       "she'll": "she will", "she'll've": "she will have",
                       "she's": "she is", "should've": "should have", "shouldn't": "should not",
                       "shouldn't've": "should not have", "so've": "so have", "so's": "so as",
                       "this's": "this is", "that'd": "that would", "that'd've": "that would have",
                       "that's": "that is", "there'd": "there would",
                       "there'd've": "there would have", "there's": "there is",
                       "here's": "here is", "they'd": "they would", "they'd've": "they would have",
                       "they'll": "they will", "they'll've": "they will have", "they're": "they are",
                       "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would",
                       "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have",
                       "we're": "we are", "we've": "we have", "weren't": "were not",
                       "what'll": "what will",
                       "what'll've": "what will have", "what're": "what are", "what's": "what is",
                       "what've": "what have", "when's": "when is", "when've": "when have",
                       "where'd": "where did", "where's": "where is", "where've": "where have",
                       "who'll": "who will", "who'll've": "who will have", "who's": "who is",
                       "who've": "who have", "why's": "why is", "why've": "why have",
                       "will've": "will have", "won't": "will not", "won't've": "will not have",
                       "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have",
                       "y'all": "you all", "y'all'd": "you all would", "y'all'd've": "you all would have",
                       "y'all're": "you all are", "y'all've": "you all have", "you'd": "you would",
                       "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                       "you're": "you are", "you've": "you have", "aint": "is not", "arent": "are not",
                       "cant": "cannot", "cause": "because",
                       "couldve": "could have", "couldnt": "could not",
                       "didnt": "did not", "doesnt": "does not",
                       "dont": "do not", "hadnt": "had not", "hasnt": "has not",
                       "havent": "have not", "howdy": "how do you",
                       "its": "it is", "lets": "let us", "maam": "madam", "maynt": "may not",
                       "mightve": "might have", "mightnt": "might not",
                       "mightntve": "might not have", "mustve": "must have",
                       "mustnt": "must not", "mustntve": "must not have",
                       "neednt": "need not", "needntve": "need not have",
                       "oclock": "of the clock", "oughtnt": "ought not",
                       "shouldve": "should have", "shouldnt": "should not",
                       "werent": "were not", "yall": "you all", "youre": "you are",
                       "youve": "you have"}

In [39]:
def expand_contraction(text):
    specials = ["’", "‘", "´", "`", "'"]

    for s in specials:
        text = text.replace(s, "'")
        text = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in text.split(" ")])
    return text



# Apply contraction expansion to the specified column
df['description'] = df['description'].apply(expand_contraction)

**4. Removing Stopwords:**

  Remove common words (stopwords) that don't contribute much to the meaning of the text. Examples include "and," "the," "is," etc. This can be done to reduce noise in your data.

In [40]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
nltk.download('punkt')

# Function to remove stopwords
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words]
    return ' '.join(filtered_text)

# Apply stopwords removal to the specified column
df['description'] = df['description'].apply(remove_stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**5. Symbol removal (Sym-rem):**

 Since special symbols (e.g., &, #, and ˆ ) are irrelevant for antisociality classification tasks, we use a regular expression matcher to identify and remove special symbols.

 This code uses the re module to define a regular expression pattern (r'[^\w\s]') that matches any character that is not a word character or whitespace.

In [41]:
# Function to remove special symbols using regular expression
def remove_special_symbols(text):
    # Define a regular expression pattern to match special symbols
    pattern = r'[^\w\s]'
    # Use re.sub to replace matched symbols with an empty string
    text = re.sub(pattern, '', text)
    return text

# Apply symbol removal to the specified column
df['description'] = df['description'].apply(remove_special_symbols)

In [42]:
df['description'][106]

' approach cover example  user  get 404   also cover  cases   eg   network problem glance backend store   user knowcare  404 example  anything  another example    upload image  glance  notfound   image url failed fetch returned   http status think      ohh shit url working    upload image uri  like said agree reason contain security issue hide  meant user  special case faulty switch   user problem big win '

**6. Repetition elimination (Rep-elm):**

 A person may repeat some of the characters to misspell an antisocial word to evade detection from a dictionary based antisociality detectors. For example, in the sentence “You’re duumbbbb!”, ‘dumb’ is misspelled through character repetitions. We have created a pattern based matcher to identify such misspelled cases and replace each with its correctly spelled form.

In [43]:
# Function to eliminate repetitions using regular expression
def eliminate_repetitions(text):
    # Define a regular expression pattern to match repeated characters (at least two repetitions)
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
    # Use re.sub to replace repeated characters with a single occurrence
    text = pattern.sub(r'\1', text)
    return text

 # Apply repetition elimination to the specified column
df['description'] = df['description'].apply(eliminate_repetitions)

 **7. Adversarial pattern identification (Adv-ptrn):**

 A person may misspell profane words by replacing some characters with a symbol (e.g., ‘f*ck’ and ‘b!tch’) or use an acronym for a slang (e.g., ‘stfu’).

 To identify such cases, we have developed a profanity preprocessor, which includes pattern matchers to identify various forms of the 85 commonly used profane words. Our preprocessor replaces each identified case with its correctly spelled form.

In [44]:
RE_PATTERNS = {
    ' fuck ':
        [
            '(f)(u|[^a-z0-9 ])(c|[^a-z0-9 ])(k|[^a-z0-9 ])([^ ])*',
            '(f)([^a-z]*)(u)([^a-z]*)(c)([^a-z]*)(k)',
            ' f[!@#\$%\^\&\*]*u[!@#\$%\^&\*]*k', 'f u u c',
            '(f)(c|[^a-z ])(u|[^a-z ])(k)', r'f\*',
            'feck ', ' fux ', 'f\*\*',
            'f\-ing', 'f\.u\.', 'f###', ' fu ', 'f@ck', 'f u c k', 'f uck', 'f ck'
        ],
    ' crap ':
        [
            ' (c)(r|[^a-z0-9 ])(a|[^a-z0-9 ])(p|[^a-z0-9 ])([^ ])*',
            ' (c)([^a-z]*)(r)([^a-z]*)(a)([^a-z]*)(p)',
            ' c[!@#\$%\^\&\*]*r[!@#\$%\^&\*]*p', 'cr@p', ' c r a p',
        ],

    ' ass ':
        [
            '[^a-z]ass ', '[^a-z]azz ', 'arrse', ' arse ', '@\$\$'
                                                           '[^a-z]anus', ' a\*s\*s', '[^a-z]ass[^a-z ]',
            'a[@#\$%\^&\*][@#\$%\^&\*]', '[^a-z]anal ', 'a s s'
        ],

    ' ass hole ':
        [
            ' a[s|z]*wipe', 'a[s|z]*[w]*h[o|0]+[l]*e', '@\$\$hole'
        ],

    ' bitch ':
        [
            'bitches', ' b[w]*i[t]*ch', ' b!tch',
            ' bi\+ch', ' b!\+ch', ' (b)([^a-z]*)(i)([^a-z]*)(t)([^a-z]*)(c)([^a-z]*)(h)',
            ' biatch', ' bi\*\*h', ' bytch', 'b i t c h'
        ],

    ' bastard ':
        [
            'ba[s|z]+t[e|a]+rd'
        ],

    ' transgender':
        [
            'transgender'
        ],

    ' gay ':
        [
            'gay', 'homo'
        ],

    ' cock ':
        [
            '[^a-z]cock', 'c0ck', '[^a-z]cok ', 'c0k', '[^a-z]cok[^aeiou]', ' cawk',
            '(c)([^a-z ])(o)([^a-z ]*)(c)([^a-z ]*)(k)', 'c o c k'
        ],

    ' dick ':
        [
            ' dick[^aeiou]', 'd i c k'
        ],

    ' suck ':
        [
            'sucker', '(s)([^a-z ]*)(u)([^a-z ]*)(c)([^a-z ]*)(k)', 'sucks', '5uck', 's u c k'
        ],

    ' cunt ':
        [
            'cunt', 'c u n t'
        ],

    ' bull shit ':
        [
            'bullsh\*t', 'bull\$hit', 'bull sh.t'
        ],

    ' jerk ':
        [
            'jerk'
        ],

    ' idiot ':
        [
            'i[d]+io[t]+', '(i)([^a-z ]*)(d)([^a-z ]*)(i)([^a-z ]*)(o)([^a-z ]*)(t)', 'idiots' 'i d i o t'
        ],

    ' dumb ':
        [
            '(d)([^a-z ]*)(u)([^a-z ]*)(m)([^a-z ]*)(b)'
        ],

    ' shit ':
        [
            'shitty', '(s)([^a-z ]*)(h)([^a-z ]*)(i)([^a-z ]*)(t)', 'shite', '\$hit', 's h i t', 'sh\*tty',
            'sh\*ty', 'sh\*t'
        ],

    ' shit hole ':
        [
            'shythole', 'sh.thole'
        ],

    ' retard ':
        [
            'returd', 'retad', 'retard', 'wiktard', 'wikitud'
        ],

    ' rape ':
        [
            'raped'
        ],

    ' dumb ass':
        [
            'dumbass', 'dubass'
        ],

    ' ass head':
        [
            'butthead'
        ],

    ' sex ':
        [
            'sexy', 's3x', 'sexuality'
        ],

    ' nigger ':
        [
            'nigger', 'ni[g]+a', ' nigr ', 'negrito', 'niguh', 'n3gr', 'n i g g e r'
        ],

    ' shut the fuck up':
        [
            ' stfu' '^stfu'
        ],

    ' for your fucking information':
        [
            ' fyfi', '^fyfi'
        ],
    ' get the fuck off':
        [
            'gtfo', '^gtfo'
        ],

    ' oh my fucking god ':
        [
            ' omfg', '^omfg'
        ],

    ' what the hell ':
        [
            ' wth', '^wth'
        ],

    ' what the fuck ':
        [
            ' wtf', '^wtf'
        ],
    ' son of bitch ':
        [
            ' sob ', '^sob '
        ],

    ' pussy ':
        [
            'pussy[^c]', 'pusy', 'pussi[^l]', 'pusses', '(p)(u|[^a-z0-9 ])(s|[^a-z0-9 ])(s|[^a-z0-9 ])(y)',
        ],

    ' faggot ':
        [
            'faggot', ' fa[g]+[s]*[^a-z ]', 'fagot', 'f a g g o t', 'faggit',
            '(f)([^a-z ]*)(a)([^a-z ]*)([g]+)([^a-z ]*)(o)([^a-z ]*)(t)', 'fau[g]+ot', 'fae[g]+ot',
        ],

    ' whore ':
        [
            'wh\*\*\*', 'w h o r e'
        ],

    ' haha ':
        [
            'ha\*\*\*ha',
        ],
}

In [45]:
# Function to eliminate repetitions based on patterns
def profanity_detector(text, patterns):
    for word, word_patterns in patterns.items():
        for pattern in word_patterns:
            # Use re.sub to replace repeated patterns with the correct spelling
            text = re.sub(pattern, f' {word} ', text)
    return text

# Apply repetition elimination to the specified column
df['description'] = df['description'].apply(lambda x: profanity_detector(x, RE_PATTERNS))

**8. Identifier splitting (Id-split):**

 In this preprocessing, we use a regular expression matcher to split identifiers written in both camelCase and under_score forms. For example, this step will replace ‘isCrap’ with ‘is Crap’ and replace ‘is_shitty’ with ‘is shitty’. This preprocessing may help to identify example code segments with profane words.

In [46]:
# Function to split identifiers based on patterns
def split_identifiers(text):
      result = re.sub('[_]+', ' ', text) # replace underscores with space
      result=re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', result))
      return result

# Apply repetition elimination to the specified column
df['description'] = df['description'].apply(split_identifiers)

**9. Programming Keywords Removal (Kwrd-rem):**

 Code review texts often include programming language specific keywords (e.g., ‘while’, ‘case’, ‘if’, ‘catch’, and ‘except’). These keywords are SE domain specific jargon and are not useful for toxicity prediction. We have created a list of 90 programming keywords used in the popular programming languages (e.g., C++, Java, Python, C#, PHP, JavaScript, and Go). This step searches and removes occurrences of those programming keywords from a text.

In [47]:
programming_keywords=[]
# Open the file in read mode
with open('/content/drive/MyDrive/codeReview/3_preprocessing/programming_keywords.txt', 'r') as file:
    # Read the file line by line
    for line in file:
        programming_keywords.append(line.strip())

programming_keywords

['while',
 'case',
 'switch',
 'def',
 'abstract',
 'byte',
 'continue',
 'native',
 'private',
 'synchronized',
 'if',
 'do',
 'include',
 'each',
 'than',
 'finally',
 'class',
 'double',
 'float',
 'int',
 'else',
 'instanceof',
 'long',
 'super',
 'import',
 'short',
 'default',
 'catch',
 'try',
 'new',
 'final',
 'extends',
 'implements',
 'public',
 'protected',
 'static',
 'this',
 'return',
 'char',
 'const',
 'break',
 'boolean',
 'bool',
 'package',
 'byte',
 'assert',
 'raise',
 'global',
 'with',
 'or',
 'yield',
 'in',
 'out',
 'except',
 'and',
 'enum',
 'signed',
 'void',
 'virtual',
 'union',
 'goto',
 'var',
 'function',
 'require',
 'print',
 'echo',
 'foreach',
 'elseif',
 'namespace',
 'delegate',
 'event',
 'override',
 'struct',
 'readonly',
 'explicit',
 'interface',
 'get',
 'set',
 'elif',
 'for',
 'throw',
 'throws',
 'lambda',
 'endfor',
 'endforeach',
 'endif',
 'endwhile',
 'clone',
 'ifdef',
 'mk']

In [48]:
def remove_keywords(text):
        words = text.split()
        resultwords = [word for word in words if word.lower() not in programming_keywords]
        result = ' '.join(resultwords)
        return result

# Apply remove keywords to the specified column
df['description'] = df['description'].apply(remove_keywords)

In [49]:
# Specify the path for the CSV file
csv_file_path = '/content/drive/MyDrive/codeReview/3_preprocessing/preprocessedData.csv'

# Write the DataFrame to a CSV file
df.to_csv(csv_file_path, index=False)