Finding similar repositories using documentation

This project is part of the 2024 Research Project at TU Delft.

Abstract

This repository is the code support for a paper that aims to study the importance of considering the documentation side of GitHub repositories when assessing the similarity between two or more applications. Readme and Wiki files, along with Comments from the source files are the set of dimensions proposed to be analyzed through our methodology and experiments. We propose a pipeline that first extracts text fragments from these dimensions and then applies Natural Language Processing techniques to further prepare our data for evaluation. To gather a similarity score we first vectorize our processed data with TF-IDF and then use cosine distance to obtain the score. Combinations of the three dimensions, ranging from using only one dimension to using all of them, are considered throughout our study. Moreover, additional information has been extracted from the plain text, such as URLs referred to and Licenses usage, the similarity of which was calculated using Jaccard distance.

Two experiments were performed. The first one aims to observe the behavioral tendencies of our methodology applied to a small dataset, while the second one aims to validate our results. By evaluating them, we found sufficient data that supported our presented conclusion: documentation represents a valuable asset in gathering a pool of similar applications.

Experimental setup

To reproduce our experiments a couple of scripts must be run, in both Java (14) and Python (3.10). The required packages for the Python scrips are available in requirements.txt. To install them, you can use pip install -r requirements.txt

Following, firstly you have to run GetDocumentation.java (src/main/java) to extract the documentation locally. Next, the Python scripts in folder scripts/ must be run in the following order: text_processing.py, similarity_calculation.py, crosssim_evaluation.py and lastly docs_distribution.py.

Three scenarios for handling missing documentation have been analyzed. To gather an accuracy plot with all the scenarios, you have to uncomment lines 155-222, 252-297 and 72 in scripts/similarity_calculation.py.

Results

The results of our experiments, including plots are available in the folder plots/, and the corresponding subfolders. Due to the size of the datasets, we decided to include only the raw data in the repository, without heatmaps or cluster maps in plots/processedDocumentation/ (it is worth noting that the results saved here contain the labels assigned during our evaluation process) and plots/processedCrossSim. In addition, a manually evaluation of the CrossSim dataset is available in plots/processedDocumentation/manual_evaluation.csv.

Acknowledgment

I would like to acknowledge the developers of CrossSim for using their evaluated repositories dataset during our validation experiment.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
plots		plots
resources		resources
scripts		scripts
src/main/java		src/main/java
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plots

plots

resources

resources

scripts

scripts

src/main/java

src/main/java

.gitignore

.gitignore

README.md

README.md

pom.xml

pom.xml

requirements.txt

requirements.txt

Repository files navigation

Finding similar repositories using documentation

Abstract

Experimental setup

Results

Acknowledgment

About

Packages

Languages

acturcu/research-documentation-similarity

Folders and files

Latest commit

History

Repository files navigation

Finding similar repositories using documentation

Abstract

Experimental setup

Results

Acknowledgment

About

Topics

Resources

Stars

Watchers

Forks

Languages