# Notebook for the experiment of building **DeCaf** (**De**sign **C**l**a**ssi**f**ier)

## Architectural Overview/Design
![alt text](https://raw.githubusercontent.com/alvi2496/DeCaf/master/assets/DataVectorizer.png)

## Objective
The main objective is to vectorize the `train`, `validate`, `test` and `cross` data. 

In [46]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
ERROR! Session/line number was not unique in database. History logging moved to new session 62


In [47]:
!pip install fasttext



In [0]:
so_model_file = "/content/drive/My Drive/documents/projects/DeCaf/models/pre_trained/so.bin"
we_model_file = "/content/drive/My Drive/documents/projects/DeCaf/models/we.bin"
enhanced_we_model_file = "/content/drive/My Drive/documents/projects/DeCaf/models/enhanced_we.bin"

In [49]:
from gensim.models.keyedvectors import KeyedVectors
import warnings
import os
from datetime import datetime as dt
import fasttext as ft


warnings.simplefilter("ignore")

so_model = KeyedVectors.load_word2vec_format(so_model_file, binary=True)
we_model = ft.load_model(we_model_file)




## Read and process **TRAIN** data
- We use three types of data from training the model. We scraped question, answer and comment data from Stack Overflow and tagged them `design` or `general`. The tagging was done automatically based on the original tag of the question. Then we process our data as we did before for our word embedding data to clear noise. We also did some additional processing to our train data. We only took those documents that contains more than 10 words. For the others, we discarded them. After processing, we got 1,00,000 documents(50,000-design, 50,000-general) for `questions.csv`, 40,000 documents(20,000-design, 20,000-general) for `answers.csv` and 60,000 documents(30,000-design, 30,000-general) for `comments.csv`.

- Our train data is completely noise free, stopwords free and long documents of more than 10 words per document. All the data is randomly distributed.

- After processing the data and converting the data in vector with the help of our trained word embedding model, we save the data as .npy file for quick access.

#### Assign train data location


In [0]:
train_question_file = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/questions.csv"
train_answer_file = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/answers.csv"
train_comment_file = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/comments.csv"

#### Read the data with pandas

In [0]:
from prettytable import PrettyTable
import pandas as pd

train_question = pd.read_csv(train_question_file)
train_answer = pd.read_csv(train_answer_file)
train_comment = pd.read_csv(train_comment_file)

#### Explore the train data

In [52]:
print("Question train data")
print(train_question.head())
print("Answer train data")
print(train_answer.head())
print("Comment train data")
print(train_comment.head())

Question train data
                                            question    label
0  control pattern according principle elsewhere ...   design
1  messenger throws error getting errors messenge...  general
2  error cannot find symbol variable trying integ...   design
3  data mapper pattern different repository patte...   design
4  start project using poetry start project using...  general
Answer train data
                                              answer    label
0  just tell solved issue found feed posted file ...  general
1  does actually implement copy cannot implement ...  general
2  following comments included code note will spe...  general
3  looks user present undefined method error need...  general
4  think better natural argue method names explic...   design
Comment train data
                                             comment    label
0  public declared class later library members sa...   design
1  using server definitely needs reasonable chanc...  general
2  defined cr

In [53]:
table = PrettyTable()

table.field_names = ["Dataset Name", "Shape", "# of design data", "# of general data"]

qd = train_question[train_question['label']=='design']
qg = train_question[train_question['label']=='general']

ad = train_answer[train_answer['label']=='design']
ag = train_answer[train_answer['label']=='general']

cd = train_comment[train_comment['label']=='design']
cg = train_comment[train_comment['label']=='general']

table.add_row(["Question", train_question.shape, qd.shape[0], qg.shape[0]])
table.add_row(["Answer", train_answer.shape, ad.shape[0], ag.shape[0]])
table.add_row(["Comment", train_comment.shape, cd.shape[0], cg.shape[0]])


print(table)

+--------------+-------------+------------------+-------------------+
| Dataset Name |    Shape    | # of design data | # of general data |
+--------------+-------------+------------------+-------------------+
|   Question   | (100000, 2) |      50000       |       50000       |
|    Answer    |  (40000, 2) |      20000       |       20000       |
|   Comment    |  (60000, 2) |      30000       |       30000       |
+--------------+-------------+------------------+-------------------+


### Convert sentences in vector uisng Word Embedding model



Now we turn our head towards converting our train data to vectors of 300 dimension. We are naming our variable by the following convension:
- question data = X_Q, question label = Y_Q
- answer data = X_A, answer label = Y_A
- comment data = X_C, comment label = Y_C

#### Location to save/retrive data in vector format

In [0]:
X_Q_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/X_Q.npy"
Y_Q_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/Y_Q.npy"

X_A_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/X_A.npy"
Y_A_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/Y_A.npy"

X_C_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/X_C.npy"
Y_C_url = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/Y_C.npy"

We define `get_text_vector(data, url)` function to convert text sentences into 300 dimension word vector. We define `get_label_vector(label, url)` to convert the labels into vector.
- look for the .npy file in the disk. if found then load the data into memory.
- if not found then 
  - convert the text and labels into vector
  - save the text and labels as .npy file 

In [0]:
import numpy as np

def get_text_vector(data, data_url):
  if os.path.exists(data_url):
    X = np.load(data_url)
  else:
    X = []
    for sentence in data:
      X.append(we_model.get_sentence_vector(sentence))
    X = np.array(X)
    np.save(data_url, X)
  return X

In [0]:
def get_label_vector(labels, label_url):
  if os.path.exists(label_url):
    Y = np.load(label_url)
  else:
    Y = []
    for label in labels:
      if label == 'design' or label == 1:
        Y.append(1)
      else:
        Y.append(0)
    Y = np.array(Y)
    np.save(label_url, Y)
  return Y

#### Load and inspect the data 

In [57]:
X_Q = get_text_vector(train_question['question'], X_Q_url)
Y_Q = get_label_vector(train_question['label'], Y_Q_url)
print('Shape of X_Q: ', X_Q.shape)
print('Shape of Y_Q: ', Y_Q.shape)

Shape of X_Q:  (100000, 300)
Shape of Y_Q:  (100000,)


In [58]:
X_A = get_text_vector(train_answer['answer'], X_A_url)
Y_A = get_label_vector(train_answer['label'], Y_A_url)
print('Shape of X_A: ', X_A.shape)
print('Shape of Y_A: ', Y_A.shape)

Shape of X_A:  (40000, 300)
Shape of Y_A:  (40000,)


In [59]:
X_C = get_text_vector(train_comment['comment'], X_C_url)
Y_C = get_label_vector(train_comment['label'], Y_C_url)
print('Shape of X_C: ', X_C.shape)
print('Shape of Y_C: ', Y_C.shape)

Shape of X_C:  (60000, 300)
Shape of Y_C:  (60000,)


## Read and process **VALIDATION** data
- We are taking the same approach as above to read, process, convert and save our validation data.
- Validation data contains a mixture of `question`, `answer` and `comment` data
- X_V for the vector of the text, Y_V is the vector of the labels for validation data.

In [0]:
validation_data_file = "/content/drive/My Drive/documents/projects/DeCaf/data/validation_data/validation.csv"

In [0]:
validation_data = pd.read_csv(validation_data_file)

In [62]:
print("Shape: ", validation_data.shape)
print(validation_data.head())

Shape:  (30000, 2)
                                                text    label
0  better handling type issue sort development us...  general
1  custom route dependency injection route define...   design
2  crawler design calling async calling service l...   design
3  page seems freeze triggered event issue based ...  general
4  install package composer require trying compos...  general


In [63]:
print("Number of design data: ", validation_data[validation_data["label"] == "design"].shape[0])
print("number of general data: ", validation_data[validation_data["label"] == "general"].shape[0])

Number of design data:  15000
number of general data:  15000


In [0]:
X_V = get_text_vector(validation_data['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/validation_data/X_V.npy")
Y_V = get_label_vector(validation_data['label'], "/content/drive/My Drive/documents/projects/DeCaf/data/validation_data/Y_V.npy")

In [65]:
print("Shape of X_V: ", X_V.shape)
print("Shape of Y_V: ", Y_V.shape)

Shape of X_V:  (30000, 300)
Shape of Y_V:  (30000,)


## Read and process **TEST** data
- We are taking the same approach as above to read, process, convert and save our validation data.
- Test data contains a mixture of `question`, `answer` and `comment` data
- X_T for the vector of the text, Y_T is the vector of the labels for validation data.

In [0]:
test_data_file = "/content/drive/My Drive/documents/projects/DeCaf/data/test_data/test.csv"

In [0]:
test_data = pd.read_csv(test_data_file)

In [68]:
print("Shape: ", test_data.shape)
print(test_data.head())

Shape:  (30000, 2)
                                                text    label
0  form using submit insert fairly working stuck ...  general
1  using shared drive portal storage sharing clie...   design
2  scale parallel project looking using together ...   design
3  decouple service associated database separate ...   design
4  method does exist trying send mail user know s...  general


In [69]:
print("Number of design data: ", test_data[test_data["label"] == "design"].shape[0])
print("number of general data: ", test_data[test_data["label"] == "general"].shape[0])

Number of design data:  15000
number of general data:  15000


In [0]:
X_T = get_text_vector(test_data['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/test_data/X_T.npy")
Y_T = get_label_vector(test_data['label'], "/content/drive/My Drive/documents/projects/DeCaf/data/test_data/Y_T.npy")

In [71]:
print("Shape of X_T: ", X_T.shape)
print("Shape of Y_T: ", Y_T.shape)

Shape of X_T:  (30000, 300)
Shape of Y_T:  (30000,)


## Read and process **Cross** data
- We are taking the same approach as above to read, process, convert and save our validation data.
- Test data contains a mixture of `question`, `answer` and `comment` data
- X_T for the vector of the text, Y_T is the vector of the labels for validation data.


### Cross dataset 
We are taking the following datasets to validate the models:
- Brunet 2014 (brunet2014.csv)
- Shakiba 2016 (shakiba2016.csv)
- Viviani 2018 (viviani2018.csv)
- Self Admitted Technical Debt/ SATD (satd.csv)
- Stack Overflow (so.csv)

In [0]:
cross_dataset_names = [
                       "Brunet 2014", 
                       "Shakiba 2016", "Viviani 2018",
                       "SATD"
]

brunet2014 = pd.read_csv("/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/brunet2014.csv")
shakiba2016 = pd.read_csv("/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/shakiba2016.csv")
viviani2018 = pd.read_csv("/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/viviani2018.csv")
satd = pd.read_csv("/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/satd.csv")
so = pd.read_csv("/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/so.csv")

### Examine the data



In [73]:
table = PrettyTable()

table.field_names = ["Dataset Name", "Total # of rows", "# of design", "# of general"]
table.add_row(["Brunet 2014", brunet2014.shape[0], \
               brunet2014[brunet2014['label'] == 1].shape[0], \
               brunet2014[brunet2014['label'] == 0].shape[0]])
table.add_row(["Shakiba 2016", shakiba2016.shape[0], \
               shakiba2016[shakiba2016['label'] == 1].shape[0], \
               shakiba2016[shakiba2016['label'] == 0 ].shape[0]])
table.add_row(["Viviani 2018", viviani2018.shape[0], \
               viviani2018[viviani2018['label'] == 1].shape[0], \
               viviani2018[viviani2018['label'] == 0 ].shape[0]])
table.add_row(["SATD", satd.shape[0], \
               satd[satd['label'] == 1].shape[0], \
               satd[satd['label'] == 0 ].shape[0]])
table.add_row(["Stack overflow", so.shape[0], \
               so[so['label'] == 1].shape[0], \
               so[so['label'] == 0 ].shape[0]])
print(table)

+----------------+-----------------+-------------+--------------+
|  Dataset Name  | Total # of rows | # of design | # of general |
+----------------+-----------------+-------------+--------------+
|  Brunet 2014   |       159       |      61     |      98      |
|  Shakiba 2016  |        67       |      25     |      42      |
|  Viviani 2018  |       159       |      61     |      98      |
|      SATD      |       2609      |     558     |     2051     |
| Stack overflow |      46744      |    23879    |    22865     |
+----------------+-----------------+-------------+--------------+


### Vectorize and Save the data

In [84]:
X_brunet2014 = get_text_vector(brunet2014['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/X_brunet2014.npy")
Y_brunet2014 = get_label_vector(brunet2014['label'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/Y_brunet2014.npy")
print(X_brunet2014.shape)
print(Y_brunet2014.shape)

(159, 300)
(159,)


In [79]:
X_shakiba2016 = get_text_vector(shakiba2016['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/X_shakiba2016.npy")
Y_shakiba2016 = get_label_vector(shakiba2016['label'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/Y_shakiba2016.npy")
print(X_shakiba2016.shape)
print(Y_shakiba2016.shape)

(67, 300)
(67,)


In [81]:
X_viviani2018 = get_text_vector(viviani2018['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/X_viviani2018.npy")
Y_viviani2018 = get_label_vector(viviani2018['label'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/Y_viviani2018.npy")
print(X_viviani2018.shape)
print(Y_viviani2018.shape)

(159, 300)
(159,)


In [82]:
X_satd = get_text_vector(satd['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/X_satd.npy")
Y_satd = get_label_vector(satd['label'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/Y_satd.npy")
print(X_satd.shape)
print(Y_satd.shape)

(2609, 300)
(2609,)


In [83]:
X_so = get_text_vector(so['text'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/X_so.npy")
Y_so = get_label_vector(so['label'], "/content/drive/My Drive/documents/projects/DeCaf/data/cross_data/processed/Y_so.npy")
print(X_so.shape)
print(Y_so.shape)

(46744, 300)
(46744,)
