# Data Analysis

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder

In [2]:
def mine_frequent_patterns(
    transactions: list[list], support: float
) -> pd.DataFrame:
    te = TransactionEncoder()
    encoding = te.fit(transactions).transform(transactions)
    encoding_df = pd.DataFrame(encoding, columns=te.columns_)
    frequent_itemsets = apriori(encoding_df, min_support=support, use_colnames=True)
    frequent_itemsets["length"] = frequent_itemsets["itemsets"].apply(
        lambda x: len(x)
    )
    return frequent_itemsets

In [3]:
workflows_df = pd.read_pickle("../dumps/workflows_df.pkl")
actions_df = pd.read_pickle("../dumps/actions_df.pkl")
frequent_actions_df = pd.read_pickle("../dumps/frequent_actions_df.pkl")
frequent_actions_noTags_df = pd.read_pickle("../dumps/frequent_actions_noTags_df.pkl")
frequent_docker_commands_subsample_df = pd.read_pickle("../dumps/frequent_docker_commands_subsample_df.pkl")

## Descriptive statistics

### Repo Filtering

- Original [RepoReapers dataset](https://reporeapers.github.io/results/1.html): **1,853,195**
- Repositories classified as containing an _engineered software project_: **446,511** (not available: **38,742**)
- Repositories with _DS-related keywords_ in topics or description: **2516**
- Repositories with at least one workflow: **155**

### Workflows

- Total number of workflows found: **399**
- Valid workflows (valid YAML file): **397**
- Invalid workflows (invalid YAML file): **2**

First of all, let's ensure that all repositories and valid workflows are being analyzed.

In [4]:
workflows_df["repository"].unique().shape

(155,)

In [5]:
workflows_df["path"] = workflows_df["repository"] + "/" + workflows_df["filename"]
workflows_df["path"].unique().shape

(397,)

#### Number of workflows per repository

In [6]:
workflows_df.groupby("repository").count()["path"].describe()

count    155.000000
mean       2.561290
std        2.401525
min        1.000000
25%        1.000000
50%        2.000000
75%        3.000000
max       14.000000
Name: path, dtype: float64

#### Most common events triggering workflows

In [9]:
transactions = workflows_df['trigger_events'].tolist()
mine_frequent_patterns(transactions, support=0.05)

Unnamed: 0,support,itemsets,length
0,0.612091,(pull_request),1
1,0.670025,(push),1
2,0.057935,(release),1
3,0.13602,(schedule),1
4,0.13602,(workflow_dispatch),1
5,0.493703,"(pull_request, push)",2
6,0.060453,"(pull_request, schedule)",2
7,0.078086,"(pull_request, workflow_dispatch)",2
8,0.057935,"(push, workflow_dispatch)",2


## RQ1 -  "Is GitHub Actions used to automate project deployment?"

- ratio of workflows presenting actions / run commands which relate to Docker
- ratio of workflows that upload a container image to Docker-Hub or to GitHub Packages
- ratio of workflows that upload any type of software package to GitHub Packages.

#### Workflows with keyword `deploy` in name or filename

#### Actions with keyword `deploy` in the slug

- Number of distinct actions used in the dataset
- Number of workflows using at least one of such actions

#### Run commands containing keyword `deploy`

- Number of workflows containing at least one of such commands

#### Workflows with keyword `publish` in name or filename

#### Actions with keyword `publish` in the slug

- Number of distinct actions used in the dataset
- Number of workflows using at least one of such actions

#### Run commands containing keyword `publish`

- Number of workflows containing at least one of such commands

#### Workflows with keyword `docker` in name or filename

In [23]:
temp_df = workflows_df.loc[
    workflows_df["filename"].str.contains("docker") | workflows_df["name"].str.contains("docker")
]
print("Size:", temp_df.shape[0])
temp_df

Size: 7


Unnamed: 0,repository,filename,name,trigger_events,n_of_actions,docker_related_actions,n_of_run_commands,docker_related_commands,docker_commands,path
295,insideout10/wordlift-plugin,qa.deplyment.yml,docker_build_and_k8s_deployment,"[push, workflow_dispatch]",9,True,0,False,[],insideout10/wordlift-plugin/qa.deplyment.yml
297,weecology/retriever,docker-publish.yml,Docker,"[push, pull_request]",6,True,5,False,[],weecology/retriever/docker-publish.yml
323,tiborsimko/invenio-data,docker-build-server.yml,Build CAP server image,[push],1,False,1,False,[],tiborsimko/invenio-data/docker-build-server.yml
324,tiborsimko/invenio-data,docker-build-ui.yml,Build CAP UI image,[push],1,False,1,False,[],tiborsimko/invenio-data/docker-build-ui.yml
328,tiborsimko/data.cern.ch,docker-build-server.yml,Build CAP server image,[push],1,False,1,False,[],tiborsimko/data.cern.ch/docker-build-server.yml
329,tiborsimko/data.cern.ch,docker-build-ui.yml,Build CAP UI image,[push],1,False,1,False,[],tiborsimko/data.cern.ch/docker-build-ui.yml
383,linkedin/gobblin,docker_build_publish.yaml,Build and Publish Docker image,"[push, pull_request, release]",4,True,1,False,[],linkedin/gobblin/docker_build_publish.yaml


#### Actions with the keyword `docker` in the slug

##### Number of distinct actions used in the dataset

Considering tags

In [43]:
actions_df.loc[actions_df.loc[:,"action_slug"].str.contains("docker"), "action_slug"].unique().shape

(14,)

Not considering tags

In [44]:
actions_df.loc[actions_df.loc[:,"action_slug"].str.contains("docker"), "action_slug_noTag"].unique().shape

(10,)

##### Number of workflows using at least one of such actions

In [48]:
actions_df.loc[actions_df.loc[:,"action_slug"].str.contains("docker"), "workflow"].unique().shape

(14,)

#### Run commands containing keyword `docker`

- Number of workflows containing at least one of such commands

#### Most common docker commands

#### Workflows with the keyword `package` in name or filename

#### Actions with the keyword `package` in the slug

- Number of distinct actions used in the dataset
- Number of workflows using at least one of such actions

#### Run commands containing keyword `package`

- Number of workflows containing at least one of such commands

## RQ2 - "What are the most frequently used Actions?"

#### Total number of actions used in the dataset

#### Average number of actions per workflow

#### Actions / Run Commands

#### Number of actions available on the GitHub Marketplace
Ratio of actions available in the Marketplace vs custom actions not registered in the marketplace

#### Number of actions by verified creators

#### Distribution of action categories

#### 10 most popular actions

#### 10 most popular actions available on the GitHub Marketplace

## RQ3 - "What are the sets of actions that typically co-occur in workflows"

#### Frequent 2-item sets of Actions

##### With tags

##### Without tags

#### Frequent 3-item sets of Actions

##### With tags

##### Without tags