# GitHub Issues Extraction
---

## Drive Mounting

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import os
current_folder = "/content/gdrive/My Drive/Workshop/TOM/"
os.chdir(current_folder)

In [None]:
repositories = os.path.join(current_folder, "repos")
number_of_repositories = len(os.listdir(repositories))
list_of_repositories = list(os.listdir(repositories))
print("Folder: {}".format(repositories)) 
print("There are {} repository files in this folder.".format(number_of_repositories))

Folder: /content/gdrive/My Drive/Workshop/TOM/repos
There are 40 repository files in this folder.


## Imports

In [None]:
!pip install PyGithub

Collecting PyGithub
  Downloading PyGithub-1.55-py3-none-any.whl (291 kB)
[K     |████████████████████████████████| 291 kB 5.2 MB/s 
[?25hCollecting deprecated
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting pynacl>=1.4.0
  Downloading PyNaCl-1.4.0-cp35-abi3-manylinux1_x86_64.whl (961 kB)
[K     |████████████████████████████████| 961 kB 39.2 MB/s 
Collecting pyjwt>=2.0
  Downloading PyJWT-2.3.0-py3-none-any.whl (16 kB)
Installing collected packages: pynacl, pyjwt, deprecated, PyGithub
Successfully installed PyGithub-1.55 deprecated-1.2.13 pyjwt-2.3.0 pynacl-1.4.0


In [None]:
import pandas as pd
import numpy as np
from github import Github
import pickle
import random

## [Repository](https://docs.google.com/spreadsheets/d/162RiTiA4xPYXUyyTwmVWp-4z88eSXAGmaocsptw8a9I/edit#gid=0) Extraction

In [None]:
with open("list_of_raw_repositories.txt", "r+") as txt_file:
    list_of_raw_repos = [element for element in txt_file.read().splitlines()]


random_raw_repo = list_of_raw_repos[random.randint(0, len(list_of_raw_repos))]
print("A random raw repository sample: \n\n {}".format(random_raw_repo))

In [None]:
list_of_repositories = [repo[19:] for repo in list_of_raw_repos]
random_repo = list_of_repositories[random.randint(0, len(list_of_repositories))]
print("A clean repository sample: \n\n {}".format(random_repo))

A clean repository sample: 

 jnr/jnr-posix


In [None]:
print("We are working with {} repositories. \n\n".format(len(list_of_repositories)))


value_counter = dict()
list_of_issues = list()

for index, repository in enumerate(list_of_repositories):
    repo = g.get_repo(repository)
    value_counter[str(repository)] = 0

    if index%20==0: print("{} of {}".format(index, 120))

    for issue in repo.get_issues(state="open"):
        floating_dict = {key: value for key, value in issue._rawData.items() if type(value)!=dict and type(value)!=list}
        floating_dict_reaction = {"reaction " + str(key): value for key, value in issue._rawData["reactions"].items()}
        floating_dict.update(floating_dict_reaction)
        list_of_issues.append(floating_dict)
        value_counter[str(repository)] += 1

We are working with 120 repositories. 


0 of 120
20 of 120
40 of 120
60 of 120
80 of 120
100 of 120


## Extracted Issues

In [None]:
all_the_keys = sorted(list(set([element for wbuch in list_of_issues for element in wbuch.keys()])))
print("We are working with {} features and {} rows. ".format(len(all_the_keys), len(list_of_issues)))

We are working with 34 features and 34268 rows. 


In [None]:
wbuchs = list()
for dictionary in list_of_issues:
    wbuch = dict()
    for key in all_the_keys:
        if key not in dictionary.keys():
            wbuch[key] = np.nan
        else:
            wbuch[key] = dictionary[key]
    wbuchs.append(wbuch)

In [None]:
repos = pd.DataFrame(data=wbuchs)
repos.sample(4)

Unnamed: 0,active_lock_reason,assignee,author_association,body,closed_at,comments,comments_url,created_at,draft,events_url,html_url,id,labels_url,locked,milestone,node_id,number,performed_via_github_app,reaction +1,reaction -1,reaction confused,reaction eyes,reaction heart,reaction hooray,reaction laugh,reaction rocket,reaction total_count,reaction url,repository_url,state,timeline_url,title,updated_at,url
4611,,,NONE,I am opening this as a separate issue since th...,,18,https://api.github.com/repos/elastic/elasticse...,2015-03-09T16:25:56Z,,https://api.github.com/repos/elastic/elasticse...,https://github.com/elastic/elasticsearch/issue...,60369836,https://api.github.com/repos/elastic/elasticse...,False,,MDU6SXNzdWU2MDM2OTgzNg==,10043,,12,0,0,0,0,0,0,0,12,https://api.github.com/repos/elastic/elasticse...,https://api.github.com/repos/elastic/elasticse...,open,https://api.github.com/repos/elastic/elasticse...,Support min_children & max_children for nested...,2020-11-08T12:30:35Z,https://api.github.com/repos/elastic/elasticse...
18612,,,CONTRIBUTOR,Resolves #388\r\n\r\nAdds convenience methods ...,,0,https://api.github.com/repos/google/gson/issue...,2021-10-03T16:43:23Z,False,https://api.github.com/repos/google/gson/issue...,https://github.com/google/gson/pull/1984,1014454854,https://api.github.com/repos/google/gson/issue...,False,,PR_kwDOAfCA984smw1P,1984,,0,0,0,0,0,0,0,0,0,https://api.github.com/repos/google/gson/issue...,https://api.github.com/repos/google/gson,open,https://api.github.com/repos/google/gson/issue...,Add `JsonWriter` methods `setEscapeNonAsciiCha...,2021-10-03T16:43:28Z,https://api.github.com/repos/google/gson/issue...
21810,,,NONE,I am trying to build a 6 node Cassandra cluste...,,1,https://api.github.com/repos/Netflix/Priam/iss...,2018-04-27T09:30:21Z,,https://api.github.com/repos/Netflix/Priam/iss...,https://github.com/Netflix/Priam/issues/675,318340035,https://api.github.com/repos/Netflix/Priam/iss...,False,,MDU6SXNzdWUzMTgzNDAwMzU=,675,,0,0,0,0,0,0,0,0,0,https://api.github.com/repos/Netflix/Priam/iss...,https://api.github.com/repos/Netflix/Priam,open,https://api.github.com/repos/Netflix/Priam/iss...,Priam bootstrap sequence,2018-06-01T04:03:37Z,https://api.github.com/repos/Netflix/Priam/iss...
11783,,,NONE,I'm using \n\n```\nFileEntity entity = new Fil...,,5,https://api.github.com/repos/android-async-htt...,2014-10-12T22:45:33Z,,https://api.github.com/repos/android-async-htt...,https://github.com/android-async-http/android-...,45596150,https://api.github.com/repos/android-async-htt...,False,,MDU6SXNzdWU0NTU5NjE1MA==,711,,0,0,0,0,0,0,0,0,0,https://api.github.com/repos/android-async-htt...,https://api.github.com/repos/android-async-htt...,open,https://api.github.com/repos/android-async-htt...,onProgress not fired on put request,2017-08-26T12:20:47Z,https://api.github.com/repos/android-async-htt...


In [None]:
repos.shape

(34268, 34)

In [None]:
repos.to_csv("repos.csv")