# DS4SE Tutorial

This quick tutorial uses [DS4SE API](https://pypi.org/project/ds4se/) to:
1. Calculate traceability value between one pair of artifacts.
2. For source and target artifact class in Libest dataset, calculate:
  
  >1) the number of documents in each class

  >2) the vocab size of each class

  >3) the average number of token in each documents of each class

  >4) the top three most frequent tokens in source and target artifact classes

  >5) the top three most frequent tokens in source artifact class

  >6) the number of shared vocabulary between source and target artifact class

  >7) the cross entropy value of source and target artifact class



This is a quick introduction on how to use the DS4SE API, to follow this tutorial in Google Colab, click the right arrow button in each cell in sequence or click Runtime-> Run all to run all the cells at once

Download and install dependent libraries of DS4SE.

In [None]:
!pip install --upgrade gensim
!pip install nbdev
!pip install sentencepiece
!pip install dit

Requirement already up-to-date: gensim in /usr/local/lib/python3.6/dist-packages (3.8.3)


Download and install DS4SE. Import TensorFlow into your program:


In [None]:
pip install ds4se



In [None]:
import ds4se.facade as facade

Import other libraries needed for this tutorial:

In [None]:
import pandas as pd
import numpy as np

Load and prapare [Libest dataset](https://github.com/WM-SEMERU/ds4se/tree/master/nbs/test_data). Convert the column name in which actual file content is stored into "contents".


In [None]:
!wget https://raw.githubusercontent.com/WM-SEMERU/ds4se/SE_Proj2_Facade/nbs/test_data/%5Blibest-pre-req%5D.csv
!wget https://raw.githubusercontent.com/WM-SEMERU/ds4se/SE_Proj2_Facade/nbs/test_data/%5Blibest-pre-tc%5D.csv

--2020-11-21 04:17:44--  https://raw.githubusercontent.com/WM-SEMERU/ds4se/SE_Proj2_Facade/nbs/test_data/%5Blibest-pre-req%5D.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60790 (59K) [text/plain]
Saving to: ‘[libest-pre-req].csv.1’


2020-11-21 04:17:44 (4.28 MB/s) - ‘[libest-pre-req].csv.1’ saved [60790/60790]

--2020-11-21 04:17:44--  https://raw.githubusercontent.com/WM-SEMERU/ds4se/SE_Proj2_Facade/nbs/test_data/%5Blibest-pre-tc%5D.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 343900 (336K) [text/plain]
Saving to: ‘[libest-p

In [None]:
source_file = pd.read_csv("[libest-pre-req].csv",names=['ids', 'text'], header=None, sep=' ')
target_file = pd.read_csv("[libest-pre-tc].csv",names=['ids', 'text'], header=None, sep=' ')
source_file = source_file.rename(columns={"text":"contents"})
target_file = target_file.rename(columns={"text":"contents"})

Create a pandas dataframe to store the result:

In [None]:
d = {'source': [], 'target': [], 'distance':[],'similarity/traceability':[]}
output_df = pd.DataFrame(data=d)

Retrive one element from source artifact and one element from target artifact to calculate traceability. Store id information for reference.

In [None]:
source_id = source_file["ids"][1].split('/')[-1]
target_id = target_file["ids"][1].split('/')[-1]
source_string = source_file["contents"][1]
target_string = target_file["contents"][1]

Call TraceLinkValue method to calcuate the distance and traceability values of this pair. In this example we used word2vec technique.

In [None]:
TLV = facade.TraceLinkValue(source_string, target_string, "word2vec")
distance = TLV[0]
traceability = TLV[1]

2020-11-21 04:17:45,247 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-11-21 04:17:45,256 : INFO : built Dictionary(1815 unique tokens: ['@return', 'Converts', 'The', 'a', 'and']...) from 153 documents (total 5769 corpus positions)
2020-11-21 04:17:45,257 : INFO : loading Word2Vec object from /usr/local/lib/python3.6/dist-packages/ds4se/model/word2vec_libest.model
2020-11-21 04:17:45,327 : INFO : loading wv recursively from /usr/local/lib/python3.6/dist-packages/ds4se/model/word2vec_libest.model.wv.* with mmap=None
2020-11-21 04:17:45,330 : INFO : setting ignored attribute vectors_norm to None
2020-11-21 04:17:45,332 : INFO : loading vocabulary recursively from /usr/local/lib/python3.6/dist-packages/ds4se/model/word2vec_libest.model.vocabulary.* with mmap=None
2020-11-21 04:17:45,332 : INFO : loading trainables recursively from /usr/local/lib/python3.6/dist-packages/ds4se/model/word2vec_libest.model.trainables.* with mmap=None
2020-11-21 04:17:45,333 : INFO : setti

Display the result:

In [None]:
print("The traceability value between artifacts {} and {} is {}".format(source_id,target_id,format(traceability,'.2f')))

The traceability value between artifacts RQ46-pre.txt and us3496.c is 0.71


Call **NumDoc** method to count the number of documents in each artifacts:

In [None]:
num_docs = facade.NumDoc(source_file, target_file)

Display the number of documents result:

In [None]:
print("Source artifacts contains {} documents, Target artifacts contains {} documents.".format(num_docs[0],num_docs[1]))

Source artifacts contains 52 documents, Target artifacts contains 21 documents.


Call **VocabSize** method to count the vocabulary size of each artifacts:

In [None]:
vocab_size = facade.VocabSize(source_file,target_file)

Display the vocabulary size of each artifacts:

In [None]:
print("Source artifacts's vocab size is {}. Target artifacts's vocab size is {}.".format(vocab_size[0],vocab_size[1]))

Source artifacts's vocab size is 2349. Target artifacts's vocab size is 3168.


Computes the average number of token in each class and also the difference between them using **AverageToken** method:


In [None]:
token = facade.AverageToken(source_file, target_file)

Display the result:

In [None]:
print("On average, each document in source artifact class contains {} tokens and each document in target artifact class contains {} tokens ".format(token[0],token[1]))

On average, each document in source artifact class contains 365.21153846153845 tokens and each document in target artifact class contains 4970.476190476191 tokens 


To find out the most frequent token in both source and target artifacts, use **VocabShared** method

In [None]:
vocab_shared = facade.VocabShared(source_file, target_file)

Display the result:

In [None]:
print("the top three most frequent token used in two artifact classes and their corresponding count and frenquency is:")
vocab_shared

the top three most frequent token used in two artifact classes and their corresponding count and frenquency is:


{'1': [2903, 0.02353065144969239],
 '8': [2241, 0.01816472266578045],
 '▁': [53876, 0.43669906217830773]}

Use **Vocab** method for the most frequent token in just source artifacts class:

In [None]:
vocab = facade.Vocab(source_file)

Display the result:

In [None]:
print("the top three most frequent token used in source artifact classes and their corresponding count and frenquency is:")
vocab

the top three most frequent token used in source artifact classes and their corresponding count and frenquency is:


{'client': [291, 0.01532304775946501],
 'est': [281, 0.014796482544363119],
 '▁': [8912, 0.4692749196988047]}

In [None]:
sharedvocabsize = facade.SharedVocabSize(source_file, target_file)

Display the result:

In [None]:
print("the number of shared token between source and target artifact classes is {}".format(sharedvocabsize))

the number of shared token between source and target artifact classes is 5042


Use **CrossEntropy** methods to calcualte cross entropy of source and target class:

In [None]:
entropy = facade.CrossEntropy(source_file, target_file)

Display the result:

In [None]:
print("The cross entropy value of source and target artifacts is {}".format(format(entropy,".2f")))

The cross entropy value of source and target artifacts is 6.16
