## Data Analysis to address the followng queries on Git Commits:

1. Determine the top 5 committers ranked by count of commits and their number of commits
2. Determine the committer with the longest streak
3. Generate a heatmap of number of commits count by all users by day-of-week and by 3 hour blocks

There are 2 user concepts, an author and a committer. We will use the committer object.

In [7]:
# to automatically reload jupyter whenever the code in module is changed
%load_ext autoreload
%reload_ext autoreload
%autoreload 2

import sqlite3
import pandas as pd
import datetime as dt
from datetime import datetime, timezone
from pandas.io.formats import style

import common as comm

local_timezone = 'Asia/Singapore'
ordered_weekday = [ 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
three_hourly_blocks = [
    ('00-03', '00:00:00' , '02:59:00'),
    ('03-06', '03:00:00' , '05:59:00'),
    ('06-09', '06:00:00' , '08:59:00'),
    ('09-12', '09:00:00' , '11:59:00'),
    ('12-15', '12:00:00' , '14:59:00'),
    ('15-18', '15:00:00' , '17:59:00'),
    ('18-21', '18:00:00' , '20:59:00'),
    ('21-00', '21:00:00' , '23:59:00'),
]


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [8]:
# read data
conn = sqlite3.connect("github_api.db")
cursor = conn.cursor()
cursor.execute("""
    SELECT * 
    FROM commits_history
""")



df = pd.DataFrame(cursor.fetchall(), columns=['committer_name', 'committer_email', 'commit_datetime', 'commit_url'])
print(f"df.shape: {df.shape}")

df.shape: (1400, 4)


### Issue 1 - Determine the top 5 committers ranked by count of commits and their number of commits

Committer, 'Github', is the most prolific committer from Mar to Aug 2022.  Github committed 1.4k changes to Apache Airflow public repo.

In [9]:
issue_1_df = df.copy()
(issue_1_df[['committer_name', 'commit_datetime']]
    .groupby(['committer_name'])
    .count()
    .sort_values(['commit_datetime'], ascending=False)
    .head())

Unnamed: 0_level_0,commit_datetime
committer_name,Unnamed: 1_level_1
GitHub,1365
Jarek Potiuk,35


### Issue 2 - Determine the committer with the longest streak

Committer 'GitHub' has the longest commit streak of 3 days for the period analyzed.

By definition, commit streak must extend beyond one day.

seek_consecutive_dates() function returns any dates that are in consecutive order e.g, "'Range9': ('2022-03-05', '2022-03-06')," means there are commits on 2022-03-05 and 2022-03-06.  Therefore, this is a commit streak of 2 days.

Within a day, there can be several commits e.g., on "2022-03-05", there are 7 commits on same day.  

In [10]:
issue_2_df = comm.localize_timestamp_to_local_timezone(df.copy(), 'commit_datetime', local_timezone)
comm.seek_consecutive_dates(issue_2_df['utc_dt_isoformat'])

{'Range1': ('2022-03-02',),
 'Range2': ('2022-03-02',),
 'Range3': ('2022-03-02',),
 'Range4': ('2022-03-02',),
 'Range5': ('2022-03-02', '2022-03-03'),
 'Range6': ('2022-03-03',),
 'Range7': ('2022-03-03',),
 'Range8': ('2022-03-03',),
 'Range9': ('2022-03-03',),
 'Range10': ('2022-03-03',),
 'Range11': ('2022-03-03',),
 'Range12': ('2022-03-03',),
 'Range13': ('2022-03-03',),
 'Range14': ('2022-03-08',),
 'Range15': ('2022-03-08',),
 'Range16': ('2022-03-08',),
 'Range17': ('2022-03-08',),
 'Range18': ('2022-03-08',),
 'Range19': ('2022-03-08', '2022-03-09'),
 'Range20': ('2022-03-09',),
 'Range21': ('2022-03-09',),
 'Range22': ('2022-03-09',),
 'Range23': ('2022-03-09',),
 'Range24': ('2022-03-09',),
 'Range25': ('2022-03-09',),
 'Range26': ('2022-03-09',),
 'Range27': ('2022-03-09',),
 'Range28': ('2022-03-09',),
 'Range29': ('2022-03-09',),
 'Range30': ('2022-03-09',),
 'Range31': ('2022-03-09',),
 'Range32': ('2022-03-09',),
 'Range33': ('2022-03-09',),
 'Range34': ('2022-03-09',

In [11]:
# to create df for respective committers
issue_2_df_by_committer = issue_2_df.groupby('committer_name')
for committer, dataframe in issue_2_df_by_committer:
    print(f"First 2 entries for {committer!r}")
    print("-----------------------------------")
    print(dataframe.head(2), end="\n\n")

First 2 entries for 'GitHub'
-----------------------------------
   committer_name     committer_email                commit_datetime  \
86         GitHub  noreply@github.com  2022-03-02T11:29:07.000+01:00   
87         GitHub  noreply@github.com  2022-03-02T11:15:36.000+01:00   

                                           commit_url  \
86  https://api.github.com/repos/apache/airflow/gi...   
87  https://api.github.com/repos/apache/airflow/gi...   

             utc_dt_local_tz utc_dt_isoformat locale_date  
86 2022-03-02 18:29:07+08:00       2022-03-02    03/02/22  
87 2022-03-02 18:15:36+08:00       2022-03-02    03/02/22  

First 2 entries for 'Jarek Potiuk'
-----------------------------------
    committer_name   committer_email                commit_datetime  \
442   Jarek Potiuk  jarek@potiuk.com  2022-04-25T17:09:00.000+02:00   
441   Jarek Potiuk  jarek@potiuk.com  2022-04-25T15:08:29.000+02:00   

                                            commit_url  \
442  https://api.githu

In [12]:
# to compare longest streak of committers
for committer, dataframe in issue_2_df_by_committer:
    print(f"Commit streak for {committer!r}")    
    print("-----------------------------------")
    consecutive_dates = comm.seek_consecutive_dates(dataframe['utc_dt_isoformat'])
    longest_streak = comm.max_len_commit_streak(consecutive_dates)
    print(f"Longest streak: {longest_streak}")
    print()

Commit streak for 'GitHub'
-----------------------------------
Longest streak: (('2022-04-07', '2022-04-08', '2022-04-09'), 3)

Commit streak for 'Jarek Potiuk'
-----------------------------------
Longest streak: (('2022-04-25', '2022-04-26'), 2)



### Issue 3 - Generate a heatmap of number of commits count by all users by day-of-week and by 3 hour blocks

Judging by the colour gradients in the heatmap below, commit frequencies peaked between 9pm to 3am intraday.  During the week, commits seems to occur most frequently on Tue/Wed.

Least amount of commits occur between 9am to noon intraday.

In [13]:
(comm.df_for_heatmap(df, local_timezone, three_hourly_blocks, ordered_weekday)
.pivot_table(index='day_of_week', columns='time_block', values = ["utc_dt_local_tz"], aggfunc='count')
.style.background_gradient(axis=None))

Unnamed: 0_level_0,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz
time_block,Unnamed: 1_level_1,00-03,03-06,06-09,09-12,12-15,15-18,18-21,21-00
day_of_week,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Mon,1,14,19,18,6,6,32,35,35
Tue,3,37,50,18,7,21,44,44,68
Wed,1,52,48,22,5,13,30,39,47
Thu,1,36,54,18,10,10,14,31,28
Fri,1,45,34,13,8,8,17,34,42
Sat,0,44,30,24,16,2,13,13,11
Sun,0,8,24,3,1,4,22,23,25
