## Data Analysis to address the followng queries on Git Commits:

1. Determine the top 5 committers ranked by count of commits and their number of commits
2. Determine the committer with the longest streak
3. Generate a heatmap of number of commits count by all users by day-of-week and by 3 hour blocks

There are 2 user concepts, an author and a committer. We will use the committer object.

In [1]:
# to automatically reload jupyter whenever the code in module is changed
%load_ext autoreload
%reload_ext autoreload
%autoreload 2

import sqlite3
import pandas as pd
import datetime as dt
from datetime import datetime, timezone
from pandas.io.formats import style

import common as comm

local_timezone = 'Asia/Singapore'
ordered_weekday = [ 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
three_hourly_blocks = [
    ('00-03', '00:00:00' , '02:59:00'),
    ('03-06', '03:00:00' , '05:59:00'),
    ('06-09', '06:00:00' , '08:59:00'),
    ('09-12', '09:00:00' , '11:59:00'),
    ('12-15', '12:00:00' , '14:59:00'),
    ('15-18', '15:00:00' , '17:59:00'),
    ('18-21', '18:00:00' , '20:59:00'),
    ('21-00', '21:00:00' , '23:59:00'),
]




In [2]:
# read data
conn = sqlite3.connect("github_api.db")
cursor = conn.cursor()
cursor.execute("""
    SELECT * 
    FROM commits_history
""")



df = pd.DataFrame(cursor.fetchall(), columns=['committer_name', 'committer_email', 'commit_datetime', 'commit_url'])
print(f"df.shape: {df.shape}")

df.shape: (14, 4)


### Issue 1 - Determine the top 5 committers ranked by count of commits and their number of commits

In [3]:
issue_1_df = df.copy()
(issue_1_df[['committer_name', 'commit_datetime']]
    .groupby(['committer_name']).count()
    .sort_values(['commit_datetime'], ascending=False)
    .head())

Unnamed: 0_level_0,commit_datetime
committer_name,Unnamed: 1_level_1
GitHub,13
Jarek Potiuk,1


### Issue 2 - Determine the committer with the longest streak

In [4]:
issue_2_df = comm.localize_timestamp_to_local_timezone(df.copy(), 'commit_datetime', local_timezone)

In [5]:
# to quickly visualize for possible streaks
issue_2_df_pv_on_name_date = issue_2_df.pivot_table(index='committer_name', columns='locale_date')

issue_2_df_pv_on_name_date

  issue_2_df_pv_on_name_date = issue_2_df.pivot_table(index='committer_name', columns='locale_date')


Unnamed: 0_level_0,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz
locale_date,03/01/22,03/22/22,04/09/22,04/13/22,04/26/22,05/17/22,05/24/22,06/07/22,06/24/22,07/05/22,07/11/22,08/04/22,08/18/22,08/27/22
committer_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
GitHub,2022-03-01 22:28:28+08:00,2022-03-22 04:38:25+08:00,2022-04-09 01:20:54+08:00,2022-04-13 03:15:29+08:00,NaT,2022-05-17 05:13:53+08:00,2022-05-24 22:10:18+08:00,2022-06-07 19:13:16+08:00,2022-06-24 23:51:01+08:00,2022-07-05 18:43:43+08:00,2022-07-11 23:52:23+08:00,2022-08-04 23:05:52+08:00,2022-08-18 23:13:21+08:00,2022-08-27 10:02:52+08:00
Jarek Potiuk,NaT,NaT,NaT,NaT,2022-04-26 05:05:00+08:00,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT


There is no commit streak in this commit history series for Apache-Airflow repo from Mar to Aug 22.   

seek_consecutive_dates() function returns any dates that are in consecutive order, if any.  Here, the output only shows the start dates for each subset of the date range given to the function.

In [6]:
comm.seek_consecutive_dates(issue_2_df['utc_dt_isoformat'])

{'Range1': ('2022-03-01',),
 'Range2': ('2022-03-22',),
 'Range3': ('2022-04-09',),
 'Range4': ('2022-04-13',),
 'Range5': ('2022-04-26',),
 'Range6': ('2022-05-17',),
 'Range7': ('2022-05-24',),
 'Range8': ('2022-06-07',),
 'Range9': ('2022-06-24',),
 'Range10': ('2022-07-05',),
 'Range11': ('2022-07-11',),
 'Range12': ('2022-08-04',),
 'Range13': ('2022-08-18',),
 'Range14': ('2022-08-27',)}

### Issue 3 - Generate a heatmap of number of commits count by all users by day-of-week and by 3 hour blocks

In [7]:
(comm.df_for_heatmap(df, local_timezone)
.pivot_table(index='day_of_week', columns='time_block', values = ["utc_dt_local_tz"], aggfunc='count')
.style.background_gradient(axis=None))

Unnamed: 0_level_0,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz,utc_dt_local_tz
time_block,00-03,03-06,09-12,18-21,21-00
day_of_week,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Mon,0,0,0,0,1
Tue,0,3,0,2,2
Wed,0,1,0,0,0
Thu,0,0,0,0,2
Fri,0,0,0,0,1
Sat,1,0,1,0,0
Sun,0,0,0,0,0
