### Extracting data from GitHub repositories using PyDriller

In this notebook we are using [PyDriller framework](https://pydriller.readthedocs.io/en/latest/intro.html) to extract information from GitHub repositories for analytics pusposes. Data from repositories such as commits, developers, modifications, diffs, and source codes, can be quickly and programatically exported to local CSV files or ingested in Cloud storage such as S3.

In [13]:
import logging
import boto3
import os

import pandas as pd
from pydriller import Repository, Git, Commit

#### Setting variables

In [42]:

# Git configurations
REPO_SLUG = "fvgm-spec/medium_notebooks"
REPO_URL = f"https://github.com/{REPO_SLUG}.git"

# S3 bucket configurations
BUCKET_PATH = f"s3://sample-data9623/github-analytics/{REPO_SLUG}/commits"

# Local path
DATA_PATH = f"output_data"

In [20]:
gr = Git(f'D:/dev/fvgm-spec/medium_notebooks')

In [26]:
commits = gr.get_list_commits()
commits  

<generator object Git.get_list_commits at 0x00000168CDA2B880>

#### Helper functions

In [51]:
def download_commits(url:str=REPO_URL) -> list:
    commits = []
    
    # Traverse the commits in the repository
    for commit in Repository(url).traverse_commits():
        commits.append(
            {
                "commit_hash": commit.hash,
                "commit_msg": commit.msg,
                "author_name": commit.author.name,
                "author_email": commit.author.email,
                "author_date": commit.author_date,
                "author_timezone": commit.author_timezone,
                "merge": commit.merge,
            }
        )

    return commits

def download_modified_files(url:str=REPO_URL) -> list:
    
    modified_files = []

    # Traverse the commits in the repository
    for commit in Repository(url).traverse_commits():

        # Iterate over the modified files in each commit
        for modified_file in commit.modified_files:
            modified_files.append(
                {
                    "commit_hash": commit.hash,
                    "filename": modified_file.filename,
                    "old_path": modified_file.old_path,
                    "new_path": modified_file.new_path,
                    "added_lines": modified_file.added_lines,
                    "deleted_lines": modified_file.deleted_lines,
                }
            )

    return modified_files

def write_commits(commit_list:list,
                  destination:str,
                  filename:str) -> pd.DataFrame:
    # Set 
    
    df = pd.DataFrame(commit_list)
    TIMESTAMP = pd.Timestamp.now().strftime("%Y%m%d%H%M%S")
    df.to_csv(f"{destination}/{filename}_{TIMESTAMP}.csv")

    return df

In [46]:
commits = download_commits()

In [52]:
modified_files = download_modified_files()

In [50]:
write_commits(commits, BUCKET_PATH, "commits")

Unnamed: 0,commit_hash,commit_msg,author_name,author_email,author_date,author_timezone,merge
0,ace05ee8918dba8f6841d313769029256684f2b4,adding README,felix-bla,felixgutierrez@ballastlane.com,2022-06-23 08:58:01-03:00,10800,False
1,ed7a068435e461427a96e07d7ba9e5bd09c1e41a,adding notebook,felix-bla,felixgutierrez@ballastlane.com,2022-06-23 09:00:21-03:00,10800,False
2,d831ea0014d1a110349635d00248c1b4d00285ae,adding file,felix-bla,felixgutierrez@ballastlane.com,2022-06-23 09:01:23-03:00,10800,False
3,57a4014a93a97ecd918ce8fd49879ac0ec959825,adding file,felix-bla,felixgutierrez@ballastlane.com,2022-06-23 09:02:09-03:00,10800,False
4,050f7c376c623eb5c8bd6ce99fe4248bc06c51a3,adding file,felix-bla,felixgutierrez@ballastlane.com,2022-06-23 09:03:23-03:00,10800,False
...,...,...,...,...,...,...,...
56,127b1a8ed2bf2c92b5abdcf02a0aca234c6340a1,updating notebook,Felix Gutierrez,felixgutierrez@ballastlane.com,2023-05-08 13:57:49-03:00,10800,False
57,2e2ca3a3b07c7f977bc3eac9547490069cfb29f0,Created using Colaboratory,Felix Gutierrez,60470663+fvgm-spec@users.noreply.github.com,2023-05-17 13:39:18-03:00,10800,False
58,40aa4601238b3b230e463b3774c60368163d73aa,adding notebook,Felix Gutierrez,felixgutierrez@ballastlane.com,2023-05-17 13:46:56-03:00,10800,False
59,a05fca0b43f4e9545e8a9fd014d84941b8d244fd,adding notebook,Felix Gutierrez,felixgutierrez@ballastlane.com,2023-05-17 13:47:44-03:00,10800,False


In [58]:
for commit in Repository(REPO_URL).traverse_commits():
    print(
        #'The commit {} has been modified by {}, '
        #'committed by {} in date {}'.format(
            commit.hash,
            commit.project_name,
            commit.msg,
            commit.committer.name,
            commit.committer_date
        )
    

ace05ee8918dba8f6841d313769029256684f2b4 medium_notebooks adding README felix-bla 2022-06-23 08:58:01-03:00
ed7a068435e461427a96e07d7ba9e5bd09c1e41a medium_notebooks adding notebook felix-bla 2022-06-23 09:00:21-03:00
d831ea0014d1a110349635d00248c1b4d00285ae medium_notebooks adding file felix-bla 2022-06-23 09:01:23-03:00
57a4014a93a97ecd918ce8fd49879ac0ec959825 medium_notebooks adding file felix-bla 2022-06-23 09:02:09-03:00
050f7c376c623eb5c8bd6ce99fe4248bc06c51a3 medium_notebooks adding file felix-bla 2022-06-23 09:03:23-03:00
09e4c841dc7b926d85be7194dbcf3feb6a459677 medium_notebooks adding new file felix-bla 2022-06-23 09:30:51-03:00
d094952e383b4193e99d55812bb597efaa03b449 medium_notebooks adding new file felix-bla 2022-06-23 09:34:25-03:00
db7528429a97b156e25214d061d93b857edf8b79 medium_notebooks adding new notebook felix-bla 2022-06-23 09:36:50-03:00
f7e16bf0ce082f46fb22188c295397061ded770a medium_notebooks updating notebook felix-bla 2022-06-25 20:25:52-03:00
2e117581a59c94d3aa

#### Git is a wrapper

#### Repository Class

#### Commit object

#### list of modified files