<a href="https://colab.research.google.com/github/alvi2496/DeCaf/blob/master/DeCaf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook for the experiment of building **DeCaf** (**De**sign **C**l**a**ssi**f**ier)

### Objective
The main objective is to classify discussions from pull request, issue tracker commit messages and code comments as `design` or `general`. We also want to make the classifier cross project compatible.

### Data Collection
- We intent to have three types of data. One data is to train the `Word Embedding` model. As `Word Embedding` requires structured form of literature, we have used sentences from literatures(ex. papers and books). Also we have restricted our choice of papers and books to only from the Software Engineering domain for keep the context of our `Word Embedding` model restricted to Software Engineering. Our `Word Embedding` model will be used to vectorize our train data.
- We have collected our train data from Stack Overflow questions, answers and comments. We have collected data and classified them as `design` or `general` based on the tag. For example, questions and answers that contain `design-patterns` or `software-design` falls under the class of `design` while data tagged as `javascript` or `django` as classified as `general`

### Data Cleaning
Raw data can have a lot of noise. Specially when scraped from documents of website, it can contain a lot of misinformation in the form of names, punctuations, numbers(ex. years), misspelled and incompleted words. Also it can have a lot of stopwords that can make the model confused. We have removed all this to make our data as clean as possible. After the cleaning process, out data only contains words that are not stopwords, present in the english dictionary and has lenght greater than three.

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Assign Data location
- literature: holds the data for word embedding

In [0]:
literature_file = "/content/drive/My Drive/documents/projects/DeCaf/data/literature.txt"
we_model_file = "/content/drive/My Drive/documents/projects/DeCaf/model/we.bin"

### Create Word Embedding
- Use fasttext for word embedding
1. we take literature.txt as our input data.
2. we train the classifier upsupervised since we just want to group the data according to the similarity.
3. we have used skipgram as we have observe that skipgram models works better with subword information that cbow
4. we are taking words with character number from 4-20. since we are removing every word less than three characters, its not important to take the characters less than 4 characters. Also, the design words seems to be on the bigger side. for ex. reproduceability contains 16 characters. we are considering characters upto 25 characters.
5. We are taking 300 dimension of the words. Also looping for 10 epochs. Both because the training corpus is relatively small.

In [3]:
!pip install fasttext



### Train the `Word Embedding` model and save it as we.bin
- Train and save the model if the model is not present
- Load the model from disk if preselt

In [4]:
import os
from datetime import datetime as dt
import fasttext as ft

if not os.path.exists(we_model_file):
  print(str(dt.now())+' Training Word Embedding model...')
  model = ft.train_unsupervised(literature_file, "skipgram", minn=4, maxn=25, dim=300, epoch=10)
  model.save_model(we_model_file)
  print(str(dt.now())+' Model trained and saved successfully')
else:
  print(str(dt.now())+' Loading Word Embedding model from disk...')
  model = ft.load_model(we_model_file)
  print(str(dt.now())+' Model loaded successfully.')

2020-03-19 10:39:53.027436 Training Word Embedding model...
2020-03-19 11:01:18.792814 Model trained and saved successfully


### Playing around with the model

Get the dimention of the model

In [5]:
model.get_dimension()

300

Get the words of the model... For a demo... 

In [6]:
model.get_words()

['software',
 'design',
 'system',
 'data',
 'development',
 'process',
 'code',
 'model',
 'used',
 'systems',
 'test',
 'using',
 'information',
 'engineering',
 'project',
 'analysis',
 'work',
 'time',
 'when',
 'different',
 'testing',
 'will',
 'example',
 'quality',
 'approach',
 'number',
 'based',
 'case',
 'requirements',
 'security',
 'figure',
 'class',
 'method',
 'program',
 'research',
 'application',
 'results',
 'service',
 'first',
 'user',
 'models',
 'methods',
 'study',
 'developers',
 'however',
 'team',
 'knowledge',
 'level',
 'state',
 'components',
 'section',
 'services',
 'table',
 'component',
 'projects',
 'problem',
 'tools',
 'management',
 'source',
 'techniques',
 'university',
 'architecture',
 'context',
 'support',
 'agile',
 'cases',
 'object',
 'need',
 'order',
 'type',
 'language',
 'users',
 'tool',
 'implementation',
 'product',
 'computer',
 'interface',
 'applications',
 'performance',
 'change',
 'three',
 'provide',
 'specific',
 'programm

Get the vector of a word

In [7]:
model.get_word_vector('Maintainability')

array([ 0.49718451,  0.2315044 ,  0.00726337, -0.11509292, -0.30824506,
        0.19385934, -0.05463782,  0.07306018, -0.41316083, -0.1308918 ,
       -0.1589214 ,  0.38056305, -0.12928438, -0.02915404,  0.38863796,
        0.06197078, -0.39293942,  0.32120928,  0.20311803,  0.02899477,
        0.1272918 ,  0.09957283,  0.1525853 ,  0.06063541, -0.3255356 ,
       -0.17407925,  0.04720291,  0.06925295, -0.25326678, -0.11411815,
        0.18991517, -0.07279769, -0.22565736,  0.05999156,  0.17503096,
       -0.13788792, -0.36241674, -0.03940193, -0.04560324, -0.46844527,
       -0.08136979, -0.01617863, -0.17989476, -0.09669915,  0.05645262,
       -0.19524406,  0.20046097,  0.08503171, -0.07665228, -0.08753125,
       -0.13431793, -0.03048689, -0.2516213 , -0.15284824,  0.31909558,
       -0.02277226, -0.27882984, -0.02941976, -0.01645686, -0.4492818 ,
       -0.04709399, -0.13244438,  0.00337953,  0.10046361,  0.0887405 ,
        0.2825424 , -0.11192344, -0.04841837, -0.02811774,  0.07

Get the vector of a sentence

In [8]:
model.get_sentence_vector("Add performance tracker to active admin jobs")

array([ 9.06293979e-04,  3.29518318e-02,  4.04947018e-03,  1.28106903e-02,
        3.78645863e-03,  7.73546696e-02,  1.71451345e-02, -2.49670707e-02,
        2.26156805e-02,  2.69182846e-02,  2.24937219e-02, -1.18468364e-03,
       -3.65603641e-02, -2.42458819e-03,  2.23642960e-02,  1.19834952e-02,
        2.24862825e-02,  1.27166035e-02, -2.83165220e-02, -1.30431820e-02,
        3.29821669e-02,  1.18801491e-02,  1.50429215e-02,  6.11085491e-03,
        2.36101821e-02, -6.99625257e-03, -2.06737034e-02,  2.87541691e-02,
        1.66713446e-02, -3.99720185e-02,  2.14984249e-02, -3.82958949e-02,
        2.27379170e-03,  4.41979952e-02,  4.70482968e-02, -1.96705980e-04,
       -4.24090400e-02, -5.00096008e-04,  1.82799343e-02, -5.17343245e-02,
       -2.00461410e-02, -1.52526740e-02, -2.25980822e-02,  3.83293652e-03,
        3.60641927e-02,  6.92354608e-03, -7.30336644e-03, -1.78665891e-02,
        5.05065732e-03,  1.36671318e-02,  1.28546180e-02, -5.15987948e-02,
       -1.13818143e-02, -

### Read train data
We use three types of data from training the model. We scraped question, answer and comment data from Stack Overflow and tagged them `design` or `general`. The tagging was done automatically based on the original tag of the question. Then we process our data as we did before for our word embedding data to clear noise. We also did some additional processing to our train data. We only took those documents that contains more than 10 words. For the others, we discarded them. After processing, we got 1,00,000 documents(50,000-design, 50,000-general) for `questions.csv`, 40,000 documents(20,000-design, 20,000-general) for `answers.csv` and 60,000 documents(30,000-design, 30,000-general) for `comments.csv`.

Our train data is completely noise free, stopwords free and long documents of more than 10 words per document. All the data is randomly distributed.

#### Assign train data location


In [0]:
train_question_file = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/questions.csv"
train_answer_file = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/answers.csv"
train_comment_file = "/content/drive/My Drive/documents/projects/DeCaf/data/train_data/comments.csv"

#### Read the data with pandas

In [0]:
import pandas as pd
from prettytable import PrettyTable

train_question = pd.read_csv(train_question_file)
train_answer = pd.read_csv(train_answer_file)
train_comment = pd.read_csv(train_comment_file)

#### Explore the train data

In [13]:
print("Question train data")
print(train_question.head())
print("Answer train data")
print(train_answer.head())
print("Comment train data")
print(train_comment.head())

Question train data
                                            question    label
0  just showing test showing real real showing al...  general
1  making models display current users data creat...  general
2  design pattern equivalents when coming world d...   design
3  size limit datatype experimenting coming using...  general
4  refactoring else following code design pattern...   design
Answer train data
                                              answer    label
0  java comes simple powerful security idea princ...   design
1  stumbled across investigating patterns think m...   design
2  answer superfluous means caller knows type mea...   design
3  current need write rails code curious details ...  general
4  whether pattern explicit query methods depends...   design
Comment train data
                                             comment    label
0  mean obvious answer output stream current scri...  general
1  book defines factory patterns factory method f...   design
2  post snipp

In [14]:
table = PrettyTable()

table.field_names = ["Dataset Name", "Shape", "# of design data", "# of general data"]

qd = train_question[train_question['label']=='design']
qg = train_question[train_question['label']=='general']

ad = train_answer[train_answer['label']=='design']
ag = train_answer[train_answer['label']=='general']

cd = train_comment[train_comment['label']=='design']
cg = train_comment[train_comment['label']=='general']

table.add_row(["Question", train_question.shape, qd.shape[0], qg.shape[0]])
table.add_row(["Answer", train_answer.shape, ad.shape[0], ag.shape[0]])
table.add_row(["Comment", train_comment.shape, cd.shape[0], cg.shape[0]])


print(table)

+--------------+-------------+------------------+-------------------+
| Dataset Name |    Shape    | # of design data | # of general data |
+--------------+-------------+------------------+-------------------+
|   Question   | (100000, 2) |      50000       |       50000       |
|    Answer    |  (40000, 2) |      20000       |       20000       |
|   Comment    |  (60000, 2) |      30000       |       30000       |
+--------------+-------------+------------------+-------------------+
