Skip to content

coinse/cs453-demo-irfl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Information Retrieval (IR) Based Fault Localisation

We will use tf-idf Vector Space Modelling (VSM) of documents to measure the similarity between the bug report and all source code files. For the hands-on, we will skip the various pre-processing stages, and only use English natural language stopwords filtering.

Dependencies

We will use scikit-learn to implement the vectorization and the similarity measurement.

Instructions

The provided irfl.py file has a skeleton to implement the IRFL heuristic. For the tf-idf vectorisation, we will use the TfidfVectorizer from the sklearn package (sklearn.feature_extraction.text.TfidfVectorizer). The API documentation is here. Note that you can submit a list of filenames to the vectorizer. This is why the step 1 is to collect all filenames. Step 2 is to use TfidfVectorizer to get the vector representations.

  1. Collect all documents (i.e., the bug report and all source files):
  2. Compute tf-idf vectors of each document

Given a matrix (i.e., a vector of vectors), you can use the pairwise cosine_similarity function from sklearn (sklearn.metrics.pairwise.cosine_similarity), whose documentation is here.

  1. Compute cosine similarity between each vector
  2. Rank source files using the similarity
  3. Report the top five files

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published