# Notebook for the experiment of building **DeCaf** (**De**sign **C**l**a**ssi**f**ier)

## Architectural Overview/Design
![alt text](https://raw.githubusercontent.com/alvi2496/DeCaf/master/assets/architectural_overview.png)

## Objective
The main objective is to classify discussions from pull request, issue tracker commit messages and code comments as `design` or `general`. We also want to make the classifier cross project compatible.

## Data Collection
- We intent to have three types of data. One data is to train the `Word Embedding` model. As `Word Embedding` requires structured form of literature, we have used sentences from literatures(ex. papers and books). Also we have restricted our choice of papers and books to only from the Software Engineering domain for keep the context of our `Word Embedding` model restricted to Software Engineering. Our `Word Embedding` model will be used to vectorize our train data.
- We have collected our train data from Stack Overflow questions, answers and comments. We have collected data and classified them as `design` or `general` based on the tag. For example, questions and answers that contain `design-patterns` or `software-design` falls under the class of `design` while data tagged as `javascript` or `django` as classified as `general`

## Data Cleaning
Raw data can have a lot of noise. Specially when scraped from documents of website, it can contain a lot of misinformation in the form of names, punctuations, numbers(ex. years), misspelled and incompleted words. Also it can have a lot of stopwords that can make the model confused. We have removed all this to make our data as clean as possible. After the cleaning process, out data only contains words that are not stopwords, present in the english dictionary and has lenght greater than three.

In [0]:
from google.colab import drive
drive.mount("/content/drive")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Assign Data location
- literature: holds the data for word embedding

In [0]:
literature_file = "/content/drive/My Drive/documents/projects/DeCaf/data/literature.txt"
enhanced_literature_file = "/content/drive/My Drive/documents/projects/DeCaf/data/enhanced_literature.txt"
so_model_file = "/content/drive/My Drive/documents/projects/DeCaf/models/pre_trained/so.bin"
we_model_file = "/content/drive/My Drive/documents/projects/DeCaf/models/we.bin"
enhanced_we_model_file = "/content/drive/My Drive/documents/projects/DeCaf/models/enhanced_we.bin"

## Enhance the Literature
- We enhance the `literature` using pre-trained model `so.bin`
  - We are taking every word of literature
  - Inject additional similar word related to software engineering to enhance.

### Read the literature.txt file

In [0]:
literature = open(literature_file, 'r')
words = literature.read().split(" ")
literature.close()
print("Total words in literature: ", len(words)) 

Total words in literature:  1575439


### Divide the words in chunks
- We divide the words in chunks to implement multiprocessing on each chunks
- We take 300000 words in each chunks leading to 5 chunks

In [0]:
file_chunks = []
file_chunks_names = []
index = 0
while index < len(words) - 1:
  chunk = []
  lower_limit = index
  i = 0
  while i < 4 and lower_limit < len(words) - 1:
    upper_limit = lower_limit + 75000
    if upper_limit > len(words):
      upper_limit = len(words) - 1
    chunk.append(words[lower_limit:upper_limit])
    print("lower_limit: " + str(lower_limit) + " upper_limit: " + str(upper_limit))
    lower_limit = upper_limit
    i = i + 1
  file_chunks_names.append(str(index) + '_' + str(lower_limit))
  index = lower_limit
  file_chunks.append(chunk)
print(file_chunks_names)

lower_limit: 0 upper_limit: 75000
lower_limit: 75000 upper_limit: 150000
lower_limit: 150000 upper_limit: 225000
lower_limit: 225000 upper_limit: 300000
lower_limit: 300000 upper_limit: 375000
lower_limit: 375000 upper_limit: 450000
lower_limit: 450000 upper_limit: 525000
lower_limit: 525000 upper_limit: 600000
lower_limit: 600000 upper_limit: 675000
lower_limit: 675000 upper_limit: 750000
lower_limit: 750000 upper_limit: 825000
lower_limit: 825000 upper_limit: 900000
lower_limit: 900000 upper_limit: 975000
lower_limit: 975000 upper_limit: 1050000
lower_limit: 1050000 upper_limit: 1125000
lower_limit: 1125000 upper_limit: 1200000
lower_limit: 1200000 upper_limit: 1275000
lower_limit: 1275000 upper_limit: 1350000
lower_limit: 1350000 upper_limit: 1425000
lower_limit: 1425000 upper_limit: 1500000
lower_limit: 1500000 upper_limit: 1575000
lower_limit: 1575000 upper_limit: 1575438
['0_300000', '300000_600000', '600000_900000', '900000_1200000', '1200000_1500000', '1500000_1575438']


### Import the so word embedding

In [0]:
from gensim.models.keyedvectors import KeyedVectors
import warnings
import os
from datetime import datetime as dt


warnings.simplefilter("ignore")

so_model = KeyedVectors.load_word2vec_format(so_model_file, binary=True)

### Inject similar words

In [0]:
def inject_similar_words(process_number, words, injected_word_chunks):
  start_time = dt.now()
  for i in range(len(words)):
    try:
      words_to_inject = [words[i]]
      similar_words = so_model.most_similar(words[i])
      for similar_tuple in similar_words:
        if similar_tuple[1] > 0.5:
          words_to_inject.append(similar_tuple[0])
      words[i] = " ".join(words_to_inject)
    except:
      continue
    if i % 10000 == 0:
      t = dt.now() - start_time
      start_time = dt.now()
      print("CPU " + str(process_number) + ": Processed: " + str(i) + " / " + str(len(words)) + " words in time: ", t)
  injected_word_chunks[process_number] = words

In [0]:
import multiprocessing as mp

print("Available CPUs: ", mp.cpu_count())
index = 0
for word_chunks in file_chunks:
  literature_chunk_name = "/content/drive/My Drive/documents/projects/DeCaf/data/literature_chunks/" + file_chunks_names[index] + ".txt"
  if not os.path.exists(literature_chunk_name):
    print("Working on chunk: ", file_chunks_names[index])
    manager = mp.Manager()
    injected_word_chunks = manager.dict()
    jobs = []

    for i in range(len(word_chunks)):
      p = mp.Process(target=inject_similar_words, args=(i, word_chunks[i], injected_word_chunks))
      jobs.append(p)
      p.start()

    for proc in jobs:
      proc.join()

    injected_word_chunks = injected_word_chunks.values()

    corpus = []
    for chunk in injected_word_chunks:
      corpus.append(" ".join(chunk))
    corpus = " ".join(corpus)
    literature = open(literature_chunk_name, 'w')
    literature.write(corpus)
    literature.close()
  index = index + 1


Available CPUs:  4


### Concatenate and save the chunks

In [0]:
if not os.path.exists(enhanced_literature_file):
  chunks_directory = "/content/drive/My Drive/documents/projects/DeCaf/data/literature_chunks/"
  print(chunks_names)
  with open(enhanced_literature_file, 'w') as outfile:
    for names in chunks_names:
      with open(chunks_directory + names) as infile:
        outfile.write(infile.read())
      outfile.write(" ")
  print('File created!')

## Create Word Embedding
- Use fasttext for word embedding
  - we take literature.txt as our input data.
  - we train the classifier upsupervised since we just want to group the data according to the similarity.
  - we have used skipgram as we have observe that skipgram models works better with subword information that cbow
  - we are taking words with character number from 4-20. since we are removing every word less than three characters, its not important to take the characters less than 4 characters. Also, the design words seems to be on the bigger side. for ex. reproduceability contains 16 characters. we are considering characters upto 25 characters.
  - We are taking 300 dimension of the words. Also looping for 10 epochs. Both because the training corpus is relatively small.

In [0]:
!pip install fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/10/61/2e01f1397ec533756c1d893c22d9d5ed3fce3a6e4af1976e0d86bb13ea97/fasttext-0.9.1.tar.gz (57kB)
[K     |█████▊                          | 10kB 25.7MB/s eta 0:00:01[K     |███████████▍                    | 20kB 1.7MB/s eta 0:00:01[K     |█████████████████               | 30kB 2.5MB/s eta 0:00:01[K     |██████████████████████▊         | 40kB 1.7MB/s eta 0:00:01[K     |████████████████████████████▍   | 51kB 2.1MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 2.0MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.1-cp36-cp36m-linux_x86_64.whl size=2385164 sha256=de8d7a1647e721bed2099191b91292c2a027a7032bc0354dd8d8a4ae1209cb6c
  Stored in directory: /root/.cache/pip/wheels/9f/f0/04/caa82c912aee89ce76358ff954f3f0729b7577c8ff23a292e3
Successfully built fasttext
Installing c

## Create Logging Methods

In [0]:
import pandas as pd

def log(log_file_path, message):
  file = pd.DataFrame([[str(dt.now()), 'info', message]], columns=['timestamp', 'type', 'message'])
  if not os.path.exists(log_file_path):
    file.to_csv(log_file_path)
  else:
    file.to_csv(log_file_path, mode='a', header=False)

def log_result(log_file_path, message):
  f = open(log_file_path, "a")
  f.write(str(dt.now()))
  f.write(message)
  f.close

## Train the `Word Embedding` model and save it as we.bin
- Train and save the model if the model is not present
- Load the model from disk if preselt

In [0]:
import fasttext as ft

log_file_path = "/content/drive/My Drive/documents/projects/DeCaf/logs/we.csv"
we_log = []

if not os.path.exists(enhanced_we_model_file):
  print(str(dt.now())+' Training Word Embedding model...')
  we_log.append(str(dt.now())+' Training Word Embedding model...')
  we_model = ft.train_unsupervised(enhanced_literature_file, "skipgram", minn=4, \
                                   maxn=25, dim=200, epoch=10)
  we_model.save_model(enhanced_we_model_file)
  print(str(dt.now())+' Model trained and saved successfully')
  we_log.append(str(dt.now())+' Model trained and saved successfully')
else:
  print(str(dt.now())+' Loading Word Embedding model from disk...')
  we_log.append(str(dt.now())+' Loading Word Embedding model from disk...')
  we_model = ft.load_model(enhanced_we_model_file)
  print(str(dt.now())+' Model loaded successfully.')
  we_log.append(str(dt.now())+' Model loaded successfully.')

log(log_file_path, \
    "\n".join(we_log))

2020-04-04 21:58:01.447887 Training Word Embedding model...


## Playing around with the Word Embedding model

Get the dimention of the model

In [0]:
we_model.get_dimension()

Get the words of the model... For a demo... 

In [0]:
we_model.get_words()[0:10]

Get the vector of a word

In [0]:
we_model.get_word_vector('Maintainability')[0:20]

Get the vector of a sentence

In [0]:
vector = we_model.get_sentence_vector("Add performance tracker to active admin jobs")
print(vector.shape)

## Read and process **TRAIN** data
- We use three types of data from training the model. We scraped question, answer and comment data from Stack Overflow and tagged them `design` or `general`. The tagging was done automatically based on the original tag of the question. Then we process our data as we did before for our word embedding data to clear noise. We also did some additional processing to our train data. We only took those documents that contains more than 10 words. For the others, we discarded them. After processing, we got 1,00,000 documents(50,000-design, 50,000-general) for `questions.csv`, 40,000 documents(20,000-design, 20,000-general) for `answers.csv` and 60,000 documents(30,000-design, 30,000-general) for `comments.csv`.

- Our train data is completely noise free, stopwords free and long documents of more than 10 words per document. All the data is randomly distributed.

- After processing the data and converting the data in vector with the help of our trained word embedding model, we save the data as .npy file for quick access.

#### Assign train data location


In [0]:
train_question_file = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/questions.csv"
train_answer_file = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/answers.csv"
train_comment_file = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/comments.csv"

#### Read the data with pandas

In [0]:
from prettytable import PrettyTable

train_question = pd.read_csv(train_question_file)
train_answer = pd.read_csv(train_answer_file)
train_comment = pd.read_csv(train_comment_file)

#### Explore the train data

In [0]:
print("Question train data")
print(train_question.head())
print("Answer train data")
print(train_answer.head())
print("Comment train data")
print(train_comment.head())

In [0]:
table = PrettyTable()

table.field_names = ["Dataset Name", "Shape", "# of design data", "# of general data"]

qd = train_question[train_question['label']=='design']
qg = train_question[train_question['label']=='general']

ad = train_answer[train_answer['label']=='design']
ag = train_answer[train_answer['label']=='general']

cd = train_comment[train_comment['label']=='design']
cg = train_comment[train_comment['label']=='general']

table.add_row(["Question", train_question.shape, qd.shape[0], qg.shape[0]])
table.add_row(["Answer", train_answer.shape, ad.shape[0], ag.shape[0]])
table.add_row(["Comment", train_comment.shape, cd.shape[0], cg.shape[0]])


print(table)

### Convert sentences in vector uisng Word Embedding model



Now we turn our head towards converting our train data to vectors of 300 dimension. We are naming our variable by the following convension:
- question data = X_Q, question label = Y_Q
- answer data = X_A, answer label = Y_A
- comment data = X_C, comment label = Y_C

#### Location to save/retrive data in vector format

In [0]:
X_Q_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/X_Q.npy"
Y_Q_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/Y_Q.npy"

X_A_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/X_A.npy"
Y_A_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/Y_A.npy"

X_C_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/X_C.npy"
Y_C_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/Y_C.npy"

We define `get_text_vector(data, url)` function to convert text sentences into 300 dimension word vector. We define `get_label_vector(label, url)` to convert the labels into vector.
- look for the .npy file in the disk. if found then load the data into memory.
- if not found then 
  - convert the text and labels into vector
  - save the text and labels as .npy file 

In [0]:
import numpy as np

def get_text_vector(data, data_url):
  if os.path.exists(data_url):
    X = np.load(data_url)
  else:
    X = []
    for sentence in data:
      X.append(we_model.get_sentence_vector(sentence))
    X = np.array(X)
    np.save(data_url, X)
  return X

In [0]:
def get_label_vector(labels, label_url):
  if os.path.exists(label_url):
    Y = np.load(label_url)
  else:
    Y = []
    for label in labels:
      if label == 'design':
        Y.append(1)
      else:
        Y.append(0)
    Y = np.array(Y)
    np.save(label_url, Y)
  return Y

#### Load and inspect the data 

In [0]:
X_Q = get_text_vector(train_question['question'], X_Q_url)
Y_Q = get_label_vector(train_question['label'], Y_Q_url)
print('Shape of X_Q: ', X_Q.shape)
print('Shape of Y_Q: ', Y_Q.shape)

In [0]:
X_A = get_text_vector(train_answer['answer'], X_A_url)
Y_A = get_label_vector(train_answer['label'], Y_A_url)
print('Shape of X_A: ', X_A.shape)
print('Shape of Y_A: ', Y_A.shape)

In [0]:
X_C = get_text_vector(train_comment['comment'], X_C_url)
Y_C = get_label_vector(train_comment['label'], Y_C_url)
print('Shape of X_C: ', X_C.shape)
print('Shape of Y_C: ', Y_C.shape)

## Read and process **VALIDATION** data
- We are taking the same approach as above to read, process, convert and save our validation data.
- Validation data contains a mixture of `question`, `answer` and `comment` data
- X_V for the vector of the text, Y_V is the vector of the labels for validation data.

In [0]:
validation_data_file = "/content/drive/My Drive/documents/projects/DeCaf/data/validation_data/validation.csv"

In [0]:
validation_data = pd.read_csv(validation_data_file)

In [0]:
print("Shape: ", validation_data.shape)
print(validation_data.head())

In [0]:
print("Number of design data: ", validation_data[validation_data["label"] == "design"].shape[0])
print("number of general data: ", validation_data[validation_data["label"] == "general"].shape[0])

In [0]:
X_V = get_text_vector(validation_data['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/validation_data/X_V.npy")
Y_V = get_label_vector(validation_data['label'], "/content/drive/My Drive/documents/projects/DeCaf/data/validation_data/Y_V.npy")

In [0]:
print("Shape of X_V: ", X_V.shape)
print("Shape of Y_V: ", Y_V.shape)

## Read and process **TEST** data
- We are taking the same approach as above to read, process, convert and save our validation data.
- Test data contains a mixture of `question`, `answer` and `comment` data
- X_T for the vector of the text, Y_T is the vector of the labels for validation data.

In [0]:
test_data_file = "/content/drive/My Drive/documents/projects/DeCaf/data/test_data/test.csv"

In [0]:
test_data = pd.read_csv(test_data_file)

In [0]:
print("Shape: ", test_data.shape)
print(test_data.head())

In [0]:
print("Number of design data: ", test_data[test_data["label"] == "design"].shape[0])
print("number of general data: ", test_data[test_data["label"] == "general"].shape[0])

In [0]:
X_T = get_text_vector(test_data['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/test_data/X_T.npy")
Y_T = get_label_vector(test_data['label'], "/content/drive/My Drive/documents/projects/DeCaf/data/test_data/Y_T.npy")

In [0]:
print("Shape of X_T: ", X_T.shape)
print("Shape of Y_T: ", Y_T.shape)

## Experiment with Traditional Data Mining Algorithms
- Before start experimenting with deep learning, we start our experiment with training some traditional data mining algorithms. We are taking the following classifiers with notations and parameter configurations:
  - k-Nearest Neighbors (`knn`)
  - Decision Tree (`dt`)
  - Random Forest (`rf`)
  - Logistic Regression (`lr`)
  - Linear SVM (`lsvm`)
    - C = Regularization Parameter
  - RBF SVM (`rbf_svm`)
    - Kernel coefficient = 2 
    - Regularization parameter = default = 1
  - Neural Net (`nn`)
  - AdaBoost (`ab`)
  - Naive Bayes (`gnb`)
  - QDA (`qda`)


### Libraries

In [0]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

### Initialization

In [0]:

classifier_names = [
      "Nearest Neighbors",
      "Decision Tree",
      "Random Forest",
      "Logistic Regression",
      "Gaussian Naive Bayes", 
      "Neural Net", 
      "AdaBoost",
      "QDA",    
      "Linear SVM", 
      "RBF SVM",
]

model_paths = [
      "/content/drive/My Drive/documents/projects/DeCaf/models/knn.joblib",
      "/content/drive/My Drive/documents/projects/DeCaf/models/dt.joblib",
      "/content/drive/My Drive/documents/projects/DeCaf/models/rf.joblib",
      "/content/drive/My Drive/documents/projects/DeCaf/models/lr.joblib",
      "/content/drive/My Drive/documents/projects/DeCaf/models/gnb.joblib",
      "/content/drive/My Drive/documents/projects/DeCaf/models/nn.joblib",
      "/content/drive/My Drive/documents/projects/DeCaf/models/ab.joblib",
      "/content/drive/My Drive/documents/projects/DeCaf/models/qda.joblib",
      "/content/drive/My Drive/documents/projects/DeCaf/models/lsvm.joblib",
      "/content/drive/My Drive/documents/projects/DeCaf/models/rbf_svm.joblib"
]

classifiers = [
      KNeighborsClassifier(n_jobs=-1),
      DecisionTreeClassifier(), 
      RandomForestClassifier(n_jobs=-1), 
      LogisticRegression(max_iter=50000),
      GaussianNB(),
      MLPClassifier(max_iter=4000),
      AdaBoostClassifier(),
      QuadraticDiscriminantAnalysis(),
      SVC(kernel="linear", C=0.025), 
      SVC(gamma=2, C=1)
]

### Combine the three train data namely: `questions`, `answers` and `comment`

In [0]:
X = np.concatenate((X_Q, X_A, X_C), axis=0)
Y = np.concatenate((Y_Q, Y_A, Y_C), axis=0)

### Examine X and Y

In [0]:
print(X.shape)
print(Y.shape)

### **Train** and **Save** the models into memory

In [0]:
from joblib import dump, load

logs = []

for index, classifier in enumerate(classifiers):
  if not os.path.exists(model_paths[index]):
    start_time = dt.now()
    print(str(start_time) + " Started training model: ", classifier_names[index])
    logs.append(str(start_time) + " Started training model: ", classifier_names[index])
    model = classifier.fit(X, Y)
    end_time = dt.now()
    print(str(end_time) + " Finished training model: ", classifier_names[index])
    logs.append(str(end_time) + " Finished training model: ", classifier_names[index])
    print("Time to train model: ", end_time - start_time)
    logs.append("Time to train model: ", end_time - start_time)
    print("----------------------------------------------------------")
    dump(model, model_paths[index])
    log("/content/drive/My Drive/documents/projects/DeCaf/logs/model_train.csv", "\n".join(log))

### Validate the models
- We use Area Under the Receiver Operating Characteristic Curve (**ROC AUC**) from prediction scores as the validation criteria.

In [0]:
from sklearn import metrics

def auc_score(X, Y, model):
  pred = model.predict(X)
  auc = metrics.roc_auc_score(Y, pred)
  return auc

#### Load the models and calculate the **ROC AUC** score

In [0]:
result_path = "/content/drive/My Drive/documents/projects/DeCaf/results/test_data.txt"

if not os.path.exists(result_path):
  table = PrettyTable()

  table.field_names = classifier_names
  log = []
  auc_scores = []
  for index, model_path in enumerate(model_paths):
    start_loading_time = dt.now()
    print(str(start_loading_time) + " Loading model: ", classifier_names[index])
    log.append(str(start_loading_time) + " Loading model: " + classifier_names[index])
    model = load(model_path)
    end_loading_time = dt.now()
    print(str(end_loading_time) + " Finished loading model: ", classifier_names[index])
    log.append(str(end_loading_time) + " Finished loading model: " + classifier_names[index])
    start_time = dt.now()
    print(str(start_time) + " Start calculating AUC ROC using: ", classifier_names[index])
    log.append(str(start_time) + " Start calculating AUC ROC using: " + classifier_names[index])
    auc_scores.append(auc_score(X_T, Y_T, model))
    end_time = dt.now()
    print(str(end_time) + " Finished calculating AUC ROC using: ", classifier_names[index])
    log.append(str(end_time) + " Finished calculating AUC ROC using: " + classifier_names[index])
    print("Calculation time: ", end_time - start_time)
    log.append("Calculation time: " + end_time - start_time)
    print("--------------------------------------------------------------------------------")
    log("/content/drive/My Drive/documents/projects/DeCaf/logs/model_performance.csv", \
        "\n".join(log))

  table.add_row(auc_scores)

  print(table)
  log_result(result_path, table.get_string())
else:
  print("Result persists at /content/drive/My Drive/documents/projects/DeCaf/results/test_data.txt")

## Cross dataset validation of the models
We are taking the following datasets to validate the models:
- Brunet 2014 (brunet2014.csv)
- Shakiba 2016 (shakiba2016.csv)
- Viviani 2018 (viviani2018.csv)
- Self Admitted Technical Debt/ SATD (satd.csv)
- Stack Overflow (so.csv)

In [0]:
cross_dataset_names = [
                       "Brunet 2014", "Shakiba 2016", "Viviani 2018",
                       "SATD"
]

brunet2014 = pd.read_csv("/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/brunet2014.csv")
shakiba2016 = pd.read_csv("/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/shakiba2016.csv")
viviani2018 = pd.read_csv("/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/viviani2018.csv")
satd = pd.read_csv("/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/satd.csv")
so = pd.read_csv("/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/so.csv")

### Examine the data



In [0]:
table = PrettyTable()

table.field_names = ["Dataset Name", "Total # of rows", "# of design", "# of general"]
table.add_row(["Brunet 2014", brunet2014.shape[0], \
               brunet2014[brunet2014['label'] == 1].shape[0], \
               brunet2014[brunet2014['label'] == 0].shape[0]])
table.add_row(["Shakiba 2016", shakiba2016.shape[0], \
               shakiba2016[shakiba2016['label'] == 1].shape[0], \
               shakiba2016[shakiba2016['label'] == 0 ].shape[0]])
table.add_row(["Viviani 2018", viviani2018.shape[0], \
               viviani2018[viviani2018['label'] == 1].shape[0], \
               viviani2018[viviani2018['label'] == 0 ].shape[0]])
table.add_row(["SATD", satd.shape[0], \
               satd[satd['label'] == 1].shape[0], \
               satd[satd['label'] == 0 ].shape[0]])
table.add_row(["Stack overflow", so.shape[0], \
               so[so['label'] == 1].shape[0], \
               so[so['label'] == 0 ].shape[0]])
print(table)

### Vectorize and Save the data

In [0]:
X_brunet2014 = get_text_vector(brunet2014['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/X_brunet2014.npy")
Y_brunet2014 = brunet2014['label']
print(X_brunet2014.shape)
print(Y_brunet2014.shape)

In [0]:
X_shakiba2016 = get_text_vector(shakiba2016['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/X_shakiba2016.npy")
Y_shakiba2016 = shakiba2016['label']
print(X_shakiba2016.shape)
print(Y_shakiba2016.shape)

In [0]:
X_viviani2018 = get_text_vector(viviani2018['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/X_viviani2018.npy")
Y_viviani2018 = viviani2018['label']
print(X_viviani2018.shape)
print(Y_viviani2018.shape)

In [0]:
X_satd = get_text_vector(satd['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/X_satd.npy")
Y_satd = satd['label']
print(X_satd.shape)
print(Y_satd.shape)

In [0]:
X_so = get_text_vector(so['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/X_so.npy")
Y_so = so['label']
print(X_so.shape)
print(Y_so.shape)

### Validate with the Trained models

In [0]:
test_data = [X_brunet2014, X_shakiba2016, X_viviani2018, X_satd]
test_label = [Y_brunet2014, Y_shakiba2016, Y_viviani2018, Y_satd]

table = PrettyTable()
table.field_names = [" "] + cross_dataset_names

for index, model_path in enumerate(model_paths):
  logs = []
  print(str(dt.now()) + " Start loading model: ", classifier_names[index])
  logs.append(str(dt.now()) + " Start loading model: " + classifier_names[index])
  model = load(model_path)
  row = [classifier_names[index]]
  for i, data in enumerate(test_data):
    start_time = dt.now()
    print(str(start_time) + " Start evaluating ", cross_dataset_names[i])
    row.append(auc_score(data, test_label[i], model))
    end_time = dt.now()
    print(str(end_time) + " Finished evaluating ", cross_dataset_names[i])
    total_time = end_time - start_time
    logs.append("Evaluation time: " + str(total_time))
  log("/content/drive/My Drive/documents/projects/DeCaf/logs/cross_data_evaluation.csv", \
      "\n".join(logs))
  table.add_row(row)

print(table)
log_result("/content/drive/My Drive/documents/projects/DeCaf/results/cross_data.txt", table.get_string())