# Preprocessing Commit Data for Role Classification.

This notebook does the complete preprocessing for the GDSC commit dataset. The goal is to take the raw data from `final_dataset.csv` and transform it into a clean, structured format. 

Here's what I did:
1.  **Load and Inspect:** Load the data and get a look at its structure.
2.  **Clean the Data:** Clean the `commitmessage` text by making it lowercase, removing punctuation, and taking out common stop words.
3.  **Engineer Features:** Create new, meaningful features from the existing columns. This includes extracting time based data, calculating commit metrics, and cleaning up the file extensions list.
4.  **Save the Results:** Save the processed data into a new CSV file, which will be the input for the modeling notebook.

### 1. Setup and Imports:

Import the libraries data manipulation and text processing.

In [2]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
import warnings

warnings.filterwarnings('ignore')

In [3]:
#Download the list of stopwords from NLTK.
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Xeron\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

### 2. Load and Inspect Data:

Load the dataset from the CSV file and observe at it's contents and structure to understand it.

In [4]:
def loadData(filePath):
    return pd.read_csv(filePath)

filePath = 'final_dataset.csv'
commitData = loadData(filePath)

print("Dataset Head:")
display(commitData.head())

print("\nDataset Info:")
commitData.info()

Dataset Head:


Unnamed: 0,index,role,committype,fileextensions,numfileschanged,linesadded,linesdeleted,numcommentsadded,timeofcommit,commitmessage
0,0,frontend,feature,[np.str_('js_ts')],4,312,100,2,Friday 17:00,"""Implement responsive UI component with dropdo..."
1,1,frontend,feature,[np.str_('css')],4,191,74,2,Friday 20:00,"""Refactor UI components: Implement responsive ..."
2,2,fullstack,feature,[np.str_('html')],4,275,146,4,Thursday 21:00,"""feat: Implement responsive UI layout with mod..."
3,3,frontend,refactor,[np.str_('js_ts')],4,245,164,2,Thursday 18:00,"""Refactored UI components for responsive layou..."
4,4,fullstack,feature,"[np.str_('js_ts'), np.str_('html')]",2,692,378,5,Sunday 20:00,"""feat: Implement responsive UI layout for logi..."



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   index             1500 non-null   int64 
 1   role              1500 non-null   object
 2   committype        1500 non-null   object
 3   fileextensions    1500 non-null   object
 4   numfileschanged   1500 non-null   int64 
 5   linesadded        1500 non-null   int64 
 6   linesdeleted      1500 non-null   int64 
 7   numcommentsadded  1500 non-null   int64 
 8   timeofcommit      1500 non-null   object
 9   commitmessage     1500 non-null   object
dtypes: int64(5), object(5)
memory usage: 117.3+ KB


### 3. Data Cleaning:

The `commitmessage` column is raw text. To make it useful for a model, I need to clean it up. This function will standardize the text by converting it to lowercase, removing punctuation and numbers, and filtering out common English stop words that don't add much meaning.

In [5]:
def cleanCommitMessage(message):
    message = message.lower()
    message = re.sub(r'[^a-zA-Z\s]', '', message)
    stopWords = set(stopwords.words('english'))
    words = message.split()
    cleanedWords = [word for word in words if word not in stopWords]
    return ' '.join(cleanedWords)

commitData['cleanedMessage'] = commitData['commitmessage'].apply(cleanCommitMessage)

print("Cleaned commit messages created.")
display(commitData[['commitmessage', 'cleanedMessage']].head())

Cleaned commit messages created.


Unnamed: 0,commitmessage,cleanedMessage
0,"""Implement responsive UI component with dropdo...",implement responsive ui component dropdown mod...
1,"""Refactor UI components: Implement responsive ...",refactor ui components implement responsive th...
2,"""feat: Implement responsive UI layout with mod...",feat implement responsive ui layout modal drop...
3,"""Refactored UI components for responsive layou...",refactored ui components responsive layout css...
4,"""feat: Implement responsive UI layout for logi...",feat implement responsive ui layout login page...


### 4. Feature Engineering:

Create new features from the existing data. These features are designed to provide the model with more direct signals about the nature of a commit.

#### 4.1 Adding Features:

Add features for the message length, the net change in code lines, and the average lines added per file. These metrics can provide clues about the type and scope of the work done.

In [6]:
def createFeatures(df):
    newDf = df.copy()
    newDf['messageLengthWords'] = newDf['cleanedMessage'].str.split().str.len()
    newDf['netCodeChange'] = newDf['linesadded'] - newDf['linesdeleted']
    newDf['linesAddedPerFile'] = (newDf['linesadded'] / newDf['numfileschanged']).fillna(0)
    newDf['linesAddedPerFile'] = newDf['linesAddedPerFile'].replace([np.inf, -np.inf], 0)
    return newDf

commitData = createFeatures(commitData)

print("Common sense features created:")
display(commitData[['messageLengthWords', 'netCodeChange', 'linesAddedPerFile']].head())

Common sense features created:


Unnamed: 0,messageLengthWords,netCodeChange,linesAddedPerFile
0,17,212,78.0
1,18,117,47.75
2,16,129,68.75
3,21,81,61.25
4,28,314,346.0


#### 4.2 Time Based Features:

The `timeofcommit` column tells me when a commit was made. Extracting the day of the week and the hour can reveal patterns in work schedules that might differ between roles.

In [7]:
def extractTimeFeatures(df):
    newDf = df.copy()
    timeParts = newDf['timeofcommit'].str.split(' ', expand=True)
    newDf['dayOfWeek'] = timeParts[0]
    newDf['hourOfDay'] = timeParts[1].str.split(':').str[0].astype(int)
    return newDf

commitData = extractTimeFeatures(commitData)

print("Time based features created:")
display(commitData[['timeofcommit', 'dayOfWeek', 'hourOfDay']].head())

Time based features created:


Unnamed: 0,timeofcommit,dayOfWeek,hourOfDay
0,Friday 17:00,Friday,17
1,Friday 20:00,Friday,20
2,Thursday 21:00,Thursday,21
3,Thursday 18:00,Thursday,18
4,Sunday 20:00,Sunday,20


#### 4.3 File Extension Features:

The `fileextensions` column is stored as a string that looks like a list. Parse this string to get a clean list of file extensions for each commit. This is a very strong indicator of a developer's role (like `.css` for frontend, `.sql` for backend).

In [8]:
def processFileExtensions(extensionsString):
    try:
        extensions = re.findall(r"'([^']*)'", extensionsString)
        return extensions
    except (TypeError, AttributeError):
        return []

commitData['processedFileExtensions'] = commitData['fileextensions'].apply(processFileExtensions)

print("File extensions cleaned up:")
display(commitData[['fileextensions', 'processedFileExtensions']].head())

File extensions cleaned up:


Unnamed: 0,fileextensions,processedFileExtensions
0,[np.str_('js_ts')],[js_ts]
1,[np.str_('css')],[css]
2,[np.str_('html')],[html]
3,[np.str_('js_ts')],[js_ts]
4,"[np.str_('js_ts'), np.str_('html')]","[js_ts, html]"


### 5. Final Review and Save:

Take a look at the final DataFrame. Select only the columns I need for modeling and drop the original, unprocessed ones. Then, save the result to a new file.

In [9]:
finalColumns = [
    'role',
    'committype',
    'numfileschanged',
    'linesadded',
    'linesdeleted',
    'numcommentsadded',
    'cleanedMessage',
    'messageLengthWords',
    'netCodeChange',
    'linesAddedPerFile',
    'dayOfWeek',
    'hourOfDay',
    'processedFileExtensions'
]

processedData = commitData[finalColumns]

print("Final preprocessed DataFrame head:")
display(processedData.head())

Final preprocessed DataFrame head:


Unnamed: 0,role,committype,numfileschanged,linesadded,linesdeleted,numcommentsadded,cleanedMessage,messageLengthWords,netCodeChange,linesAddedPerFile,dayOfWeek,hourOfDay,processedFileExtensions
0,frontend,feature,4,312,100,2,implement responsive ui component dropdown mod...,17,212,78.0,Friday,17,[js_ts]
1,frontend,feature,4,191,74,2,refactor ui components implement responsive th...,18,117,47.75,Friday,20,[css]
2,fullstack,feature,4,275,146,4,feat implement responsive ui layout modal drop...,16,129,68.75,Thursday,21,[html]
3,frontend,refactor,4,245,164,2,refactored ui components responsive layout css...,21,81,61.25,Thursday,18,[js_ts]
4,fullstack,feature,2,692,378,5,feat implement responsive ui layout login page...,28,314,346.0,Sunday,20,"[js_ts, html]"


In [11]:
outputFilePath = 'processed_dataset.csv'
processedData.to_csv(outputFilePath, index=False)

print(f"Preprocessing complete. Data saved to {outputFilePath}")

Preprocessing complete. Data saved to processed_dataset.csv
