This repository contains the analysis of issue existence in workflow files in different repositories.
- Run ActionLint on the workflow files.
- Run the script to count the total number of lines in the workflows.
- Choose only the workflow files that are valid workflow files.
- Merge the dataset with the dataset from ActionLint on the
file_hashcolumn.
- Create new features such as:
status,next commit,error count,unique workflow id.
- Perform resampling to obtain a normalized dataset.
- Map the custom Issue Rule to the dataset.
The dataset comprises numerous features which are as follows:
- Message: Error message in the workflow file reported by ActionLint.
- Line: Line of the error message in the workflow file.
- Column: Column of the error message in the workflow file.
- Kind: Error messages are divided into various kinds such as:
deprecated-commands,expression,runner-label,events,action,syntax-check,workflow-call,matrix,glob,job-needs,id,env-var,shell-name,yaml-syntaxdepending on the text of the error message. - Snippet: Information about the code snippet where the error occurs in the workflow file.
- End Column: End column of the error message in the workflow file.
- File Hash: Hash of the workflow file.
- Repository: Repository of the workfile.
- Commit Hash: Hash of the commit of the workflow file in the repository.
- Author Name: Author name of the committer.
- Author Email: Author email of the committer.
- Committer Name: Name of the committer.
- Committer Email: Email of the committer.
- Committed Date: Committed date of the commit.
- Authored Date: Date and time when the changes were originally created by the contributor.
- File Path: File path of the workflow in the repository.
- Previous File Path: Previous file path of the workflow file in the repository.
- Previous File Hash: Hash of the previous workflow file in the repository.
- Change Type: Refers to the change in the workflow file such as Added (A), Modified (M), and Rename (R).
- Valid YAML: Provides information about the file that it is a valid YAML file but not clear that it is a workflow file.
- Valid Workflow: Provides information that it is a valid workflow file.
- Lines Count: Total number of lines in the workflow.
- Time Lapse: Provides the time difference between the first commit with the error message and the specified commit in the repository.
- Status: Provides the information that the issue is open or closed in the workflow file at the specified commit.
- Error Count: Provides the information about the total number of issues at a specified commit in the workflow.
- Unique Workflow ID: Gives the unique workflow ID that is the combination of the
{author}/{repo}/{filepath}/{hash_val}of the workflow.
Here's a glimpse of the pseudocode for extracting essential features:
function error_count(repository, dataframe):
count = 0
msgstack = []
for each row in dataframe:
if row['repository'] equals repository then
if row['message'] is in msgstack then
Remove row['message'] from msgstack
count = count - 1
Update row['error_count'] in dataframe to count
else:
Add row['message'] to msgstack
count = count + 1
Update row['error_count'] in dataframe to count
end if
end for
return dataframefunction time_lapse(repository, dataset):
for each row in dataset:
if row['repository'] equals repository then
Convert row['committed_date'] to datetime object
Add converted row['committed_date'] back to dataset
Sort dataset by 'committed_date' in ascending order
Set start_date TO MINIMUM 'committed_date' value in dataset
for each row in dataset:
Calculate time difference between row['committed_date'] and start_date
Set row['time_lapse'] to calculated time difference
return datasetfunction status_feature(df):
for each unique_commit in df['commit_hash']:
Create df_temp containing rows where 'commit_hash' EQUALS unique_commit
next_commit = rows in df where 'commit_hash' equals the 'next_commit_hash' of the first row of df_temp
common_messages = get common messages between df_temp and next_commit
set 'status' to "closed" for rows where 'commit_hash' equals unique_commit in df
set 'status' to "open" for rows where 'commit_hash' equals unique_commit and 'message' is in common_messages in df
return dffunction generate_workflow_id(dataset):
sort dataset by 'committed_date'
mapping = {}
result = {}
for each row in dataset:
repo = get 'repository' from row
filepath = get 'file_path' from row
hash_val = get 'commit_hash' from row
previous = get 'previous_file_path' from row
author = get 'author_name' from row
if filepath is not empty and type of previous equals float then
unique_id = concatenate author, "/", repo, "/", filepath, "/", hash_val
mapping[(repo, filepath)] = unique_id
result[index of row] = unique_id
else if filepath is empty and previous is not empty then
if (repo, previous) not in mapping then
raise value error
else:
delete mapping[(repo, previous)]
result[index of row] = null
else if filepath is not empty and previous is not empty then
unique_id = concatenate author, "/", repo, "/", filepath, "/", hash_val
if (repo, previous) not in mapping then
print "cannot find unique id for previous filepath: {repo}/{previous}"
result[index of row] = unique_id
continue
else:
mapping[(repo, filepath)] = mapping[(repo, previous)]
result[index of row] = mapping[(repo, filepath)]
return resultfunction resampling(dataset, repository):
dataset = filter dataset where 'repository' equals repository
if length of dataset < 2 then
return dataset
else:
error_count_sum = sum of 'error_count' grouped by 'date' in dataset
sort dataset by 'committed_date'
temp_list = []
for each row in dataset:
date = get 'date' from row
next_date = get 'next_date' from row
number_of_days = calculate difference in days between next_date and date
expected_next_date = calculate the expected next date
error_sum = get error count sum for the date from error_count_sum
if date equals next_date then
continue to next iteration
else if date not equal to next_date and number_of_days > 1 then
insert duplicate row for expected next date
append inserted rows to temp_list
print "row inserted !!!"
if temp_list is empty then
return dataset
else:
concatenate temp_list and dataset into df_data
sort df_data by 'committed_date'
return df_data