# List of Commits

Test assignment: https://docs.google.com/document/d/1hnBYR3x798YU5ewLpk8Kz5pEmDnCbQ2wd-lM1ff9wgg/edit#

## 1. Get raw commits

### 1.1. Clone [huggingface/transformers](https://github.com/huggingface/transformers)

In [1]:
!git clone https://github.com/huggingface/transformers.git

### 1.2. Enter repository folder

In [2]:
cd transformers

/home/bernie/Git/List-of-Commits/transformers


### 1.3. Creating files with commits' information

At this step we have access to commits' information. To simlify further process of parsing, I decided to separate commits information into two files: file 'commit_info.csv' with commits' number, number of changed files, number of insertions and number of deletions and file 'commit_message.txt' with commits' messages.

#### Creating 'commit_info.csv'

In [3]:
!echo 'commit_#,changed_files,diff_+,diff_-' > ../commit_info.csv

In [4]:
!git log --shortstat --pretty='@%H,' | tr "\n" " " | tr "@" "\n" >> ../commit_info.csv

In [5]:
!head -10 ../commit_info.csv

commit_#,changed_files,diff_+,diff_-

699e90437f984d69ad3c9b891dd2e9d0fc2cffe4,   1 file changed, 1 insertion(+), 1 deletion(-) 
c54646b13d468b7a21fd6ee18f943ad69daab48e,   3 files changed, 168 insertions(+), 1 deletion(-) 
cc3d0e1b017dbb8dcbba1eb01be77aef7bacee1a,   22 files changed, 1682 insertions(+), 6 deletions(-) 
3a9476d1b412274bcc51143acaaee187e9d18120,   1 file changed, 8 insertions(+), 7 deletions(-) 
60d1f31bb009d09e884699bfe30ac34555bb4a5c,   47 files changed, 89 insertions(+), 89 deletions(-) 
5011efbec81a7a1d094a2eda8bde2b74613ca8b8,   1 file changed, 3 insertions(+), 2 deletions(-) 
504ae9181ca3f0918033f098d10a2c63153e26a6,   1 file changed, 4 insertions(+), 4 deletions(-) 
6cb7d6ec36ec5b3f97f76d8e243cf539fec78949,   1 file changed, 6 insertions(+), 1 deletion(-) 


#### Creating 'commit_message.txt'

In [6]:
!git log --pretty='%s' > ../commit_message.txt

In [7]:
!head -10 ../commit_message.txt

flan-t5.mdx: fix link to large model (#20555)
Add ESM contact prediction (#20535)
[New Model] Add TimeSformer model (#18908)
fix cuda OOM by using single Prior (#20486)
v4.26.0.dev0
Fix link in pipeline device map (#20517)
Fix Hubert models in TFHubertModel and TFHubertForCTC documentation code (#20516)
Fix doctest (#20534)
QnA example: add speed metric (#20522)
update post_process_image_guided_detection (#20521)


### 1.4. Exit repository folder

In [8]:
cd ../

/home/bernie/Git/List-of-Commits


## 2. Parse raw commits

In [9]:
import pandas as pd
import numpy as np

### 2.1. Read 'commit_info.csv' into DataFrame

In [10]:
df = pd.read_csv("commit_info.csv")

In [11]:
df.head()

Unnamed: 0,commit_#,changed_files,diff_+,diff_-
0,699e90437f984d69ad3c9b891dd2e9d0fc2cffe4,1 file changed,1 insertion(+),1 deletion(-)
1,c54646b13d468b7a21fd6ee18f943ad69daab48e,3 files changed,168 insertions(+),1 deletion(-)
2,cc3d0e1b017dbb8dcbba1eb01be77aef7bacee1a,22 files changed,1682 insertions(+),6 deletions(-)
3,3a9476d1b412274bcc51143acaaee187e9d18120,1 file changed,8 insertions(+),7 deletions(-)
4,60d1f31bb009d09e884699bfe30ac34555bb4a5c,47 files changed,89 insertions(+),89 deletions(-)


The following data processing is made here:
- deleted unnessesary column 'changed_files'
- Nan fields replaces with 0
- in 'diff_+' and 'diff_-' columns numbers are extracted from strings

In [12]:
df.drop("changed_files", axis=1, inplace=True)
df.fillna(0, inplace=True)

In [13]:
df.head()

Unnamed: 0,commit_#,diff_+,diff_-
0,699e90437f984d69ad3c9b891dd2e9d0fc2cffe4,1 insertion(+),1 deletion(-)
1,c54646b13d468b7a21fd6ee18f943ad69daab48e,168 insertions(+),1 deletion(-)
2,cc3d0e1b017dbb8dcbba1eb01be77aef7bacee1a,1682 insertions(+),6 deletions(-)
3,3a9476d1b412274bcc51143acaaee187e9d18120,8 insertions(+),7 deletions(-)
4,60d1f31bb009d09e884699bfe30ac34555bb4a5c,89 insertions(+),89 deletions(-)


In [14]:
df['diff_+'] = df['diff_+'].str.extract('(\d+)')
df['diff_-'] = df['diff_-'].str.extract('(\d+)')

In [15]:
df.head()

Unnamed: 0,commit_#,diff_+,diff_-
0,699e90437f984d69ad3c9b891dd2e9d0fc2cffe4,1,1
1,c54646b13d468b7a21fd6ee18f943ad69daab48e,168,1
2,cc3d0e1b017dbb8dcbba1eb01be77aef7bacee1a,1682,6
3,3a9476d1b412274bcc51143acaaee187e9d18120,8,7
4,60d1f31bb009d09e884699bfe30ac34555bb4a5c,89,89


### 2.2. Read 'commit_message.txt' and add 'commit_message' column in DataFrame

In [16]:
f = open("commit_message.txt", "r")
messages = [line.strip() for line in f]

In [17]:
df.insert(1, 'commit_message', messages)

In [18]:
df.head()

Unnamed: 0,commit_#,commit_message,diff_+,diff_-
0,699e90437f984d69ad3c9b891dd2e9d0fc2cffe4,flan-t5.mdx: fix link to large model (#20555),1,1
1,c54646b13d468b7a21fd6ee18f943ad69daab48e,Add ESM contact prediction (#20535),168,1
2,cc3d0e1b017dbb8dcbba1eb01be77aef7bacee1a,[New Model] Add TimeSformer model (#18908),1682,6
3,3a9476d1b412274bcc51143acaaee187e9d18120,fix cuda OOM by using single Prior (#20486),8,7
4,60d1f31bb009d09e884699bfe30ac34555bb4a5c,v4.26.0.dev0,89,89


### 2.3. Save resulted DataFrame

In [19]:
df.to_csv("list_of_commits.csv")