### Extracting data from GitHub repositories using PyDriller

In this notebook we are using [PyDriller framework](https://pydriller.readthedocs.io/en/latest/intro.html) to extract information from GitHub repositories for analytics pusposes. Data from repositories such as commits, developers, modifications, diffs, and source codes, can be quickly and programatically exported to local CSV files or ingested in Cloud storage such as S3.

In [23]:
import logging
import boto3
import os

import pandas as pd
import duckdb
from pydriller import Repository, Git, Commit

#### Setting variables

In [24]:
# Git configurations
REPO_SLUG = "fvgm-spec/my_test_repo"
REPO_URL = f"https://github.com/{REPO_SLUG}.git"
#https://github.com/fvgm-spec/medium_notebooks/my_test_repo.git
# S3 bucket configurations
BUCKET_PATH = f"s3://sample-data9623/github-analytics/{REPO_SLUG}/commits"

# Local path
DATA_PATH = f"output_data"

#Setting connection with DuckDB
conn = duckdb.connect()

#### Helper functions

In [19]:
def download_commits(url:str) -> list:
    commits = []
    
    # Traverse the commits in the repository
    for commit in Repository(url).traverse_commits():
        commits.append(
            {
                "commit_hash": commit.hash,
                "commit_msg": commit.msg,
                "author_name": commit.author.name,
                "author_email": commit.author.email,
                "author_date": commit.author_date,
                "author_timezone": commit.author_timezone,
                "merge": commit.merge,
            }
        )

    return commits

def download_modified_files(url:str=REPO_URL) -> list:
    
    modified_files = []

    # Traverse the commits in the repository
    for commit in Repository(url).traverse_commits():

        # Iterate over the modified files in each commit
        for modified_file in commit.modified_files:
            modified_files.append(
                {
                    "commit_hash": commit.hash,
                    "filename": modified_file.filename,
                    "old_path": modified_file.old_path,
                    "new_path": modified_file.new_path,
                    "added_lines": modified_file.added_lines,
                    "deleted_lines": modified_file.deleted_lines,
                }
            )

    return modified_files

def write_commits(commit_list:list,
                  destination:str,
                  filename:str) -> pd.DataFrame:
    
    df = pd.DataFrame(commit_list)
    TIMESTAMP = pd.Timestamp.now().strftime("%Y%m%d%H%M%S")
    df.to_csv(f"{destination}/{filename}_{TIMESTAMP}.csv")

    return df

#### Git wrapper

In [10]:
# Creating an instance of the Git wrapper pointing to the path of the local repository
gr = Git(f"D:/dev/fvgm-spec/my_test_repo")
commits = gr.total_commits()
commits

4

#### Repository Class and Commit object

**Repository** is the main class of Pydriller, responsible of returning the list of commits you want. One of the main advantage of using PyDriller to mine software repositories is that it is highly configurable. We will now see all the options that once can pass to Repository.

Commit object has all the information of a Git commit, and much more. 

In [22]:
for commit in Repository(REPO_URL).traverse_commits():
    print(
        'The commit {} has been modified by {}, '
        'committed by {} in date {}'.format(
            commit.hash,
            commit.author.name,
            commit.committer.name,
            commit.committer_date
        )
    )

The commit 5ded457557b06e27fd72db8deeefe60e6c4b275d has been modified by Felix Gutierrez, committed by Felix Gutierrez in date 2023-09-06 21:53:09-03:00
The commit 10c0ab2f500bc07cb47ab165cb91fc592ac85003 has been modified by Felix Gutierrez, committed by Felix Gutierrez in date 2023-09-08 13:52:05-03:00
The commit b4fc85feca49da95d2d090de7285e7032d9b2680 has been modified by Felix Gutierrez, committed by Felix Gutierrez in date 2023-09-08 13:53:26-03:00
The commit fca65c2b6c6fd6a8cb45ff391f146f1794eb1cab has been modified by Felix Gutierrez, committed by Felix Gutierrez in date 2023-09-08 13:54:03-03:00


#### Extracting data from mid-size repo

In [35]:
#Extracting commits from DuckDB repo
DUCKDB_REPO = 'https://github.com/duckdb/duckdb'
duckdb_commits = download_commits(DUCKDB_REPO)

In [27]:
#Writing data to CSV file
write_commits(duckdb_commits, DATA_PATH, "duckdb_commits")

Unnamed: 0,commit_hash,commit_msg,author_name,author_email,author_date,author_timezone,merge
0,ba75d81601913782d28a3878707d135319f38bdd,Working parser + initial draft of interface,Mark Raasveldt,mark.raasveldt@gmail.com,2018-07-13 14:29:24+02:00,-7200,False
1,82f1559651a20e7a46890ac5127f4d580959f777,Add partial source tree transformation of SELE...,Mark Raasveldt,mark.raasveldt@gmail.com,2018-07-16 16:33:20+02:00,-7200,False
2,4bbe14a209e1d9db965b0f4f9136993ebb796d92,Add namespaces and transform FROM statement.,Mark Raasveldt,mark.raasveldt@gmail.com,2018-07-17 02:01:28+02:00,-7200,False
3,3db9596e75b3beaf1e0fe1b48b8847c6e8d3b157,"Also properly transform GROUP BY, ORDER BY, LI...",Mark Raasveldt,mark.raasveldt@gmail.com,2018-07-17 11:05:26+02:00,-7200,False
4,2b8130d335ab4afef2ae641633fdcf7b5bb0bca6,Simple catalog definition and hardcoded lineit...,Mark Raasveldt,mark.raasveldt@gmail.com,2018-07-17 16:09:17+02:00,-7200,False
...,...,...,...,...,...,...,...
32440,c9c1e16ca16405cab50c4a7b34db9a6ed7abc3ce,Extensions autoloading: Add duckdb_aws and fix...,Carlo Piovesan,piovesan.carlo@gmail.com,2023-09-07 20:05:24+02:00,-7200,False
32441,64936b6658e9fa936f42789379d1510e3e8d2db9,Bump uncovered_files.csv,Carlo Piovesan,piovesan.carlo@gmail.com,2023-09-08 01:32:08+02:00,-7200,False
32442,20cdfd4c9744c75c0b2f3e4e3d0c29da18bbed89,Merge pull request #8839 from carlopi/tryautol...,Mark,mark.raasveldt@gmail.com,2023-09-08 11:19:29+02:00,-7200,True
32443,e1503b1647a8b71a61e5fb22656eaad62a24780a,Merge pull request #8826 from szarnyasg/remove...,Mark,mark.raasveldt@gmail.com,2023-09-08 11:21:12+02:00,-7200,True


#### Querying data with DuckDB

In [32]:
#Counting the total rows of the dataset
conn.query("""
SELECT COUNT(*)
FROM 'output_data/duckdb_commits_*.csv'
""").fetchall()

[(32445,)]

In [34]:
#Counting the number of commits per committer
conn.query("""
SELECT author_name,COUNT(commit_hash)
FROM 'output_data/duckdb_commits_*.csv'
GROUP BY author_name""").fetchall()

[('Pedro Holanda', 2320),
 ('Sam Ansmink', 1254),
 ('Mark Raasveldt', 8766),
 ('Tania Bogatsch', 566),
 ('Tishj', 2635),
 ('Mark', 3020),
 ('Richard Wesley', 1797),
 ('Lindsay Wray', 164),
 ('satotake', 6),
 ('Hannes Mühleisen', 2635),
 ('Elliana May', 519),
 ('Laurens Kuiper', 2479),
 ('Josh Wills', 90),
 ('Max Gabrielsson', 311),
 ('Tom Ebergen', 318),
 ('Jacob', 29),
 ('Koki', 4),
 ('lokax', 152),
 ('huangzichun', 3),
 ('Yves', 45),
 ('DouEnergy', 82),
 ('Koki Ueha', 91),
 ('Nicholas Ursa', 4),
 ('Tmonster', 166),
 ('Jens', 62),
 ('Trent Hauck', 2),
 ('Jens-H', 22),
 ('Aleksandr Golubov', 2),
 ('Will Scullin', 18),
 ('Carlo Piovesan', 382),
 ('Mac Lockard', 3),
 ('stephaniewang', 119),
 ('ila', 1),
 ('Yannick Welsch', 14),
 ('ashish01', 3),
 ('Eero Lihavainen', 14),
 ('Pedro Ferreira WX1175653', 62),
 ('Sebastian Jaenicke', 2),
 ('Kirill Müller', 358),
 ('Chang She', 4),
 ('Jonathan Swenson', 7),
 ('rmq', 1),
 ('Chris Brain', 1),
 ('RJ Atwal', 52),
 ('Lars Verdoes', 206),
 ('Lei Xu'

#### Storing data in S3

In [40]:
# S3 bucket configurations
REPO_SLUG = "duckdb/duckdb"
BUCKET_PATH = f"s3://sample-data9623/github-analytics/{REPO_SLUG}/commits"

#Writing data to S3 bucket
write_commits(duckdb_commits, BUCKET_PATH, "duckdb_commits")

Unnamed: 0,commit_hash,commit_msg,author_name,author_email,author_date,author_timezone,merge
0,ba75d81601913782d28a3878707d135319f38bdd,Working parser + initial draft of interface,Mark Raasveldt,mark.raasveldt@gmail.com,2018-07-13 14:29:24+02:00,-7200,False
1,82f1559651a20e7a46890ac5127f4d580959f777,Add partial source tree transformation of SELE...,Mark Raasveldt,mark.raasveldt@gmail.com,2018-07-16 16:33:20+02:00,-7200,False
2,4bbe14a209e1d9db965b0f4f9136993ebb796d92,Add namespaces and transform FROM statement.,Mark Raasveldt,mark.raasveldt@gmail.com,2018-07-17 02:01:28+02:00,-7200,False
3,3db9596e75b3beaf1e0fe1b48b8847c6e8d3b157,"Also properly transform GROUP BY, ORDER BY, LI...",Mark Raasveldt,mark.raasveldt@gmail.com,2018-07-17 11:05:26+02:00,-7200,False
4,2b8130d335ab4afef2ae641633fdcf7b5bb0bca6,Simple catalog definition and hardcoded lineit...,Mark Raasveldt,mark.raasveldt@gmail.com,2018-07-17 16:09:17+02:00,-7200,False
...,...,...,...,...,...,...,...
32440,c9c1e16ca16405cab50c4a7b34db9a6ed7abc3ce,Extensions autoloading: Add duckdb_aws and fix...,Carlo Piovesan,piovesan.carlo@gmail.com,2023-09-07 20:05:24+02:00,-7200,False
32441,64936b6658e9fa936f42789379d1510e3e8d2db9,Bump uncovered_files.csv,Carlo Piovesan,piovesan.carlo@gmail.com,2023-09-08 01:32:08+02:00,-7200,False
32442,20cdfd4c9744c75c0b2f3e4e3d0c29da18bbed89,Merge pull request #8839 from carlopi/tryautol...,Mark,mark.raasveldt@gmail.com,2023-09-08 11:19:29+02:00,-7200,True
32443,e1503b1647a8b71a61e5fb22656eaad62a24780a,Merge pull request #8826 from szarnyasg/remove...,Mark,mark.raasveldt@gmail.com,2023-09-08 11:21:12+02:00,-7200,True
