In [None]:
#hide
import ds4se.facade as facade
import pandas as pd

# ds4se

> Data Science for Software Engieering (ds4se) is an academic initiative to perform exploratory analysis on software engineering artifacts (e.g., requirements, issues, source code, or test cases) and metadata (e.g., repository logs, databases logs, or binaries). This EDA involves the following components:

 - 1. Data Management {mgmnt}
     - Pandas (by default): all our data must be handled using pandas dataframes
         - DF_REQ
         - DF_CODE
         - DF_TC
     - Data Processing:
         - Transformations (e.g, data cleaning)
         - Tokenizers (e.g, nltk or BPEs)
 - 2. Data Representation {repr}
     - 2.1 Canonical Representation
         - Bag-of-words (e.g., TF-IDF)
         - LDAs
         - LSI
     - 2.2 Neural Representation
         - Text Based Representation
             - Language Models
                 - Doc2vec
                 - Word2vec
                 - Transformer-Based
                     - CodeBert
                 - RNN-Based
             - Structured Models
                 - Code2vec
                 - Athena
        - Video Based Representation (?) 
 - 3. Data Analysis {eda}
     - 3.1 Descriptive Statistics
         - Software Metrics (e.g, coupling, cyclo, lcom5, etc)
     - 3.2 Information Science 
         - Information Metrics (e.g, joint entropy, self-information, or mutual information)
     - 3.3 Inference Data Analysis: Finding patterns or using them in evaluations
         - Bayesian Inference
         - Causal Inference
 - 4. Data Mining and Information Retrieval {mining}
     - 4.1 Data Mining
     - 4.2 Information Retrieval {mining.ir}
         
Related Libraries:
     - [DVC-](https://dvc.org/doc/start)
     - [CML-](https://cml.dev/) Continuous Machine Learning 
     - mlprog_templante

This file will become your README and also the index of your documentation.

## Install

`pip install ds4se`

## How to use

In [None]:
import ds4se.facade as facade

## Traceability

To use the ds4se library to calculate trace link value of proposed trace link with given.

    Supported technique model:
        VSM
        LDA
        orthogonal 
        LSA
        JS
        word2vec
        doc2vec

The function returns a tuple of two integers, with the first element as distance between two artifacts and the second element be the similarity between two artifacts, which is the traceability value.

In [None]:
facade.TraceLinkValue("source_string","target_string","LDA")

0.01

word2vec_metric is an optional parameter when using word2vec as technique, available metrics are: 
   <br> WMD
  <br>  SCM

## Analysis

### Usage of ds4se model to calculate the number of documents of either source or target class

    The method takes in two parameters, a pandas dataframe for source artifacts and a pandas data frame for target artifacts, and it will do calculation for both classes.
    
    The method returns a list of 4 integers:
    1: number of documents for source artifacts;
    2: number of documents for target artifacts;
    3: source difference (difference between previous two results);
    4: target difference (same as above, but opposite).

In [None]:
result = facade.NumDoc("source","target")
source_doc = result[0]
target_doc = result[1]
difference_source = result[2]
difference_target = result[3]
print("The number of documents for source is {} , with {} source difference".format(source_doc, difference_source))
print("The number of documents for target is {} , with {} target difference".format(target_doc, difference_target))

The number of documents for source is 160 , with 32 source difference
The number of documents for target is 128 , with -32 target difference


For all functions in analysis part, input should be pandas dataframe with following structure

In [None]:
d = {'contents': ["hello world", "this is a content of another file"]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,contents
0,hello world
1,this is a content of another file


### Usage of ds4se model to calculate the vocabulary size of either source or target class

    The method takes in two parameters, source artifacts and target artifacts, and it will do calculation for both classes.
    
    The method returns a list of 4 integers:
    1: vocabulary size for source artifacts;
    2: vocabulary size for target artifacts;
    3: source difference;
    4: target difference.

In [None]:
vocab_result = facade.VocabSize(source_df, target_df)
source = vocab_result[0]
target = vocab_result[1]
difference_source = vocab_result[2]
difference_target = vocab_result[3]
print("The vocabulary size for source is {} , with {} target difference".format(source, difference_source))
print("The vocabulary size for target is {} , with {} target difference".format(target, difference_target))

The vocabulary size for source is 179 , with 35 target difference
The vocabulary size for target is 144 , with -35 target difference


### Usage of ds4se model to calculate the average number of token of either source or target class

    The method takes in two parameters, source artifacts and target artifacts, and it will do calculation for both classes.
    
    The method returns a list of 4 integers:
    1: average number of token for source artifacts;
    2: average number of token for target artifacts;
    3: source difference;
    4: target difference.

In [None]:
token_result = facade.AverageToken(source_df, target_df)
source = token_result[0]
target = token_result[1]
difference_source = vocab_result[2]
difference_target = vocab_result[3]
print("The number of average token for source is {} , with {} source difference".format(source, difference_source))
print("The number of average token for target is {} , with {} target difference".format(target, difference_target))

The number of average token for source is 107 , with 35 source difference
The number of average token for target is 143 , with -35 target difference


### Usage of ds4se model to retriev term frequency

    The method takes in two parameters, 
    1: source artifacts,
    2: target artifacts, 
    and it will do calculation for both classes.
    
    The method returns a dictonary with 
    key: token
    value: a list of count and frequency

In [None]:
facade.VocabShared(source_df,target_df)

{'est': [160, 0.16], 'http': [136, 0.136], 'frequnecy': [124, 0.124]}

If we only need the term frequency of one of two classes, we can use Vocab() function

**The filename should be the path to the file**

In [None]:
facade.Vocab(artifacts_df)

{'est': [141, 0.141], 'http': [136, 0.136], 'frequnecy': [156, 0.156]}

### For Shared Metrics

Using the following metrics to compute using both source and target artifacts, use the following funtions. 

They all require two parameters: source and target artifacts. 

And return one int value

Shared vocabulary size

In [None]:
facade.SharedVocabSize(source_df, target_df)

112

Mutual information

In [None]:
facade.MutualInformation(source_df, target_df)

127

Corss Entropy

In [None]:
facade.CrossEntropy(source_df, target_df)

171

KL Divergence

In [None]:
facade.KLDivergence(source_df, target_df)

152