# GitHub Data from `thoth-station` organization

In this notebook, we will be accessing github data from thoth-station organization. We will be accessing data from all the repos so that we can later filter them and use the data for training TTM ML model. 

The motivation is to look out for the list of repos containing non-thoth contribution so that we can use the PRs data from those repos for training our model.

In [80]:
import os
import time
import pandas as pd

from github import Github, RateLimitExceededException
from tqdm import tqdm
from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())

True

In [81]:
access_token = os.getenv("ACCESS_TOKEN")
g = Github(access_token)

Checking the extraction rate is important. In this case, we have a minimum rate limit of 5000 request per hour. Once, your rate limit is over. You need to wait for another hour in order to extract the data. 

In [146]:
g.rate_limiting

(330, 5000)

At first, we are extracting the surface level information like number of repos in thoth-station organization. 

In [None]:
project_list = g.get_organization("thoth-station").get_repos()

df_github = pd.DataFrame()
for project in tqdm(project_list):

    g = Github(access_token)
    # print(f"Extracting data from {project}")

    repo = project

    PRs = repo.get_pulls(state="all", base="master")

    all_issues = repo.get_issues(state="all")

    df_github = df_github.append(
        {
            "Full_name": repo.full_name,
            "Forks": repo.forks_count,
            "Stars": repo.stargazers_count,
            "last_updated": repo.updated_at,
            "PR_count": PRs.totalCount,
            "Issue_count": all_issues.totalCount,
        },
        ignore_index=True,
    )

In [83]:
# df_github.to_csv("df_github.csv") #saving it for future use
df = pd.read_csv("df_github.csv", index_col=0)
df = df_github
df.head()

Unnamed: 0,Forks,Full_name,Issue_count,PR_count,Stars,last_updated
0,10.0,thoth-station/package-extract,484.0,402.0,1.0,2022-01-10 20:24:39
1,22.0,thoth-station/core,404.0,276.0,26.0,2022-04-30 11:18:51
2,6.0,thoth-station/result-api,424.0,359.0,0.0,2020-08-05 13:59:11
3,18.0,thoth-station/user-api,1761.0,1299.0,7.0,2022-05-16 07:55:53
4,6.0,thoth-station/cleanup-job,306.0,226.0,1.0,2021-07-21 11:33:25


In [84]:
print(
    f"The number of repos in thoth-station organization is {df['Full_name'].nunique()}"
)

The number of repos in thoth-station organization is 179


In the next step, I will be looping over each repo to extract individual issue information.

In [None]:
repos = list(df["Full_name"])
# repos = ['thoth-station/package-analyzer']
df_github3 = pd.DataFrame()
for repo in tqdm(repos):
    while True:
        try:
            g = Github(access_token, retry=3, timeout=5)
            # print(f"Extracting data from {project}")
            repo = g.get_repo(repo)

            all_issues = repo.get_issues(state="all")

            for issue in all_issues:
                while True:
                    try:

                        if issue.pull_request is not None:
                            break
                        df_github3 = df_github3.append(
                            {
                                "Project_ID": repo.id,
                                "Name": repo.name,
                                "Full_name": repo.full_name,
                                "issue_number": issue.number,
                                "owner": issue.user.name,
                                "owner_username": issue.user.login,
                            },
                            ignore_index=True,
                        )
                    except RateLimitExceededException as e:
                        print(e.status)
                        print("Rate limit exceeded")
                        print(g.rate_limiting)
                        time.sleep(300)
                        continue
                    break
        except RateLimitExceededException as e:
            print(e.status)
            print("Rate limit exceeded")
            print(g.rate_limiting)
            time.sleep(300)
            continue
        break
# df_github3.to_csv('df_github3.csv')

In [85]:
df_repo = pd.read_csv("df_github3.csv", index_col=0)

In [86]:
df_repo.head(20)

Unnamed: 0,Full_name,Name,Project_ID,issue_number,owner,owner_username
0,thoth-station/package-extract,package-extract,117243068.0,482.0,Maya Costantini,mayaCostantini
1,thoth-station/package-extract,package-extract,117243068.0,480.0,Francesco Murdaca,pacospace
2,thoth-station/package-extract,package-extract,117243068.0,478.0,Thoth Bot,sesheta
3,thoth-station/package-extract,package-extract,117243068.0,475.0,Thoth Bot,sesheta
4,thoth-station/package-extract,package-extract,117243068.0,473.0,Harshad Reddy Nalla,harshad16
5,thoth-station/package-extract,package-extract,117243068.0,471.0,Harshad Reddy Nalla,harshad16
6,thoth-station/package-extract,package-extract,117243068.0,469.0,Fridolín Pokorný,fridex
7,thoth-station/package-extract,package-extract,117243068.0,461.0,Fridolín Pokorný,fridex
8,thoth-station/package-extract,package-extract,117243068.0,457.0,,khebhut[bot]
9,thoth-station/package-extract,package-extract,117243068.0,456.0,Fridolín Pokorný,fridex


In [87]:
print(f"Number of repos : {df_repo['Full_name'].nunique()}.")

Number of repos : 144.


In [88]:
print(
    f"We see that there is some difference between original repo number({df['Full_name'].nunique()})"
    f" and the number of repo, after extracting the issue information({df_repo['Full_name'].nunique()})."
    f"The difference is because of the fact that the {179-144} repo does not have any issues opened."
    f"Hence, in our next case we will be excluding those repo where there is no issue opened."
)

We see that there is some difference between original repo number(179) and the number of repo, after extracting the issue information(144).The difference is because of the fact that the 35 repo does not have any issues opened.Hence, in our next case we will be excluding those repo where there is no issue opened.


**List of repos with no issues**

In [89]:
df_repo_with_no_issue = df[
    (df["Full_name"].apply(lambda x: x not in list(df_repo["Full_name"])))
]

In [90]:
len(df_repo_with_no_issue["Full_name"].unique())

35

In the next case, we will be filtering based on,

- Number of stars for each repo.
- Repo which are active from last year.
- Contributions from the non-thoth members.

### Filtering based on Stars

In [92]:
df.describe()

Unnamed: 0,Forks,Issue_count,PR_count,Stars
count,179.0,179.0,179.0,179.0
mean,5.547486,339.458101,264.212291,3.089385
std,4.580336,1820.874735,1765.676513,12.83792
min,0.0,0.0,0.0,0.0
25%,2.0,6.5,3.5,0.0
50%,5.0,38.0,23.0,1.0
75%,8.0,192.0,128.0,2.0
max,29.0,23487.0,23432.0,164.0


We are filtering those repos having `stars > 2` for all 179 repos from thoth organization.

In [99]:
df1 = df[(df["Stars"] > 2)]

In [100]:
print(
    f"After filtering, we get the number of repos having stars greater than 2 is {df1['Full_name'].nunique()}."
)

After filtering, we get the number of repos having stars greater than 2 is 38.


### Filtering based on activity

We filter and keep only those repo which were active from last year. Here also, we apply the filter on all 179 repos.

In [101]:
df2 = df[(df["last_updated"] > "2021-5-1")]

In [103]:
df2.head()

Unnamed: 0,Forks,Full_name,Issue_count,PR_count,Stars,last_updated
0,10.0,thoth-station/package-extract,484.0,402.0,1.0,2022-01-10 20:24:39
1,22.0,thoth-station/core,404.0,276.0,26.0,2022-04-30 11:18:51
3,18.0,thoth-station/user-api,1761.0,1299.0,7.0,2022-05-16 07:55:53
4,6.0,thoth-station/cleanup-job,306.0,226.0,1.0,2021-07-21 11:33:25
5,11.0,thoth-station/solver,5176.0,700.0,14.0,2022-05-03 23:15:28


In [104]:
print(
    f"The number of repo that were active from last year is {df2['Full_name'].nunique()}."
)

The number of repo that were active from last year is 122.


### Filtering based on contribution from non-thoth account

We have the list of thoth members, which includes not only present members but also members from the past,

In [137]:
thoth_members = [
    "codificat",
    "erikerlandson",
    "fridex",
    "Gkrumbach07",
    "goern",
    "Gregory-Pereira",
    "harshad16",
    "HumairAK",
    "KPostOffice",
    "mayaCostantini",
    "meile18",
    "oindrillac",
    "pacospace",
    "schwesig",
    "sesheta",
    "tumido",
    "xtuchyna",
    "bot",
    "sub-mod",
    "GiorgosKarantonis",
    "CermakM",
    "bjoernh2000",
    "srushtikotak",
    "4n4nd",
    "EldritchJS",
    "sesheta-srcops",
    "Shreyanand",
    "bissenbay",
    "saisankargochhayat",
    "pacospace",
]

strings_to_exclude = "|".join(thoth_members)

In [138]:
df_repo.head(2)

Unnamed: 0,Full_name,Name,Project_ID,issue_number,owner,owner_username
0,thoth-station/package-extract,package-extract,117243068.0,482.0,Maya Costantini,mayaCostantini
1,thoth-station/package-extract,package-extract,117243068.0,480.0,Francesco Murdaca,pacospace


In [139]:
df3 = df_repo[~df_repo["owner_username"].str.contains(strings_to_exclude)]

In [140]:
df3.head(20)

Unnamed: 0,Full_name,Name,Project_ID,issue_number,owner,owner_username
819,thoth-station/solver,solver,120300624.0,5166.0,Sebastian Schuberth,sschuberth
5263,thoth-station/solver,solver,120300624.0,69.0,Akash Parekh,ace2107
6074,thoth-station/dependency-monkey,dependency-monkey,122620723.0,18.0,Sarah Masud,sara-02
6587,thoth-station/notebooks,notebooks,123426453.0,44.0,,shruthi-raghuraman
6714,thoth-station/adviser,adviser,123548968.0,1945.0,Isabel Zimmerman,isabelizimm
7503,thoth-station/kebechet,kebechet,134576983.0,860.0,,JamesKunstle
7520,thoth-station/kebechet,kebechet,134576983.0,802.0,,qJJt7oxN
7565,thoth-station/kebechet,kebechet,134576983.0,702.0,Landon LaSmith,LaVLaS
7566,thoth-station/kebechet,kebechet,134576983.0,701.0,Landon LaSmith,LaVLaS
7749,thoth-station/kebechet,kebechet,134576983.0,189.0,Tomas Tomecek,TomasTomecek


In [141]:
print(f"The names of non-thoth members are : {df3['owner'].unique()}.")

The names of non-thoth members are : ['Sebastian Schuberth' 'Akash Parekh' 'Sarah Masud' nan 'Isabel Zimmerman'
 'Landon LaSmith' 'Tomas Tomecek' 'Vaclav Pavlin' 'Anish Asthana'
 'Sorin Sbarnea' 'Tomas' 'John Vandenberg' 'Alan Chin' 'Ryan Kraus'
 'Chad Roberts' 'Martin D. Jaere' 'Hema Veeradhi' 'Dowon' 'Matt Carr'
 "Lumír 'Frenzy' Balhar" 'Aurélien Bompard' 'Humberto Anjos'
 'Deleted user' 'Vadim Bauer' 'Willy Hardy' 'Miro Hrončok'
 'Michael Clifford' 'Rishabh Aggarwal' 'Karanraj Chauhan'
 'Surya Prakash Pathak' 'Ariel Shulman' 'Anup Kumar' 'Sophie Watson'
 'Guillaume Moutier'].


After excluding those rows which have information about the contribution from thoth members and bots. We are left with the dataset (df3) which list out the contribution from non-thoth account.

In [142]:
print(
    f"The number of repos which has contribution from non-thoth account is {df3['Full_name'].nunique()}."
)

The number of repos which has contribution from non-thoth account is 28.


### Union of three filters

Lastly, once we have the filtered list of all the repo. In order to get a final list of repos which can be significant for further analysis. We will get the union of all three filtered sets.

In [143]:
df_union = set(df1["Full_name"]).union(
    set(df2["Full_name"]).union(set(df3["Full_name"]))
)

In [144]:
print(f"The number of repos that we get is {len(df_union)}")

The number of repos that we get is 128


**The list of repo which we can consider for further analysis**

In [145]:
df_union

{'thoth-station/.github',
 'thoth-station/adviser',
 'thoth-station/aicoe-ci-pulp-upload-example',
 'thoth-station/amun-api',
 'thoth-station/amun-client',
 'thoth-station/amun-hwinfo',
 'thoth-station/analyzer',
 'thoth-station/ansible-role-argo-workflows',
 'thoth-station/buildlog-parser',
 'thoth-station/cleanup-job',
 'thoth-station/cli-examples',
 'thoth-station/common',
 'thoth-station/core',
 'thoth-station/cve-update-job',
 'thoth-station/datasets',
 'thoth-station/dependency-monkey',
 'thoth-station/dependency-monkey-zoo',
 'thoth-station/document-sync-job',
 'thoth-station/elyra-resnet',
 'thoth-station/fext',
 'thoth-station/glyph',
 'thoth-station/graph-backup-job',
 'thoth-station/graph-metrics-exporter',
 'thoth-station/graph-refresh-job',
 'thoth-station/graph-sync-job',
 'thoth-station/helm-charts',
 'thoth-station/help',
 'thoth-station/image-pusher',
 'thoth-station/init-job',
 'thoth-station/integration-tests',
 'thoth-station/invectio',
 'thoth-station/investigator'

# Conclusion

We listed the number of repos that can be significant for training the TTM model. In the next case, we will extract all the issues and PRs data for the above listed repos.