<span style="color:orange;font-size:20pt;font-weight:bold">CS412 - Machine Learning - Term Project Notebook - "MLCoders"</span>

<span style="font-size:14pt;">This Jupyter notebook contains the necessary code blocks for both importing, cleaning, vectorizing and training the dataset and model as well as the EDA parts for the sake of gathering results and giving clear recommendations.</span>

<span style="color:red;font-weight:bold">Required Libraries</span>

In [1]:
import pandas as pd 
import numpy as np
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Here, for classification, tokenization and parsing, we have got help from the leading human language data platform NLTK (Natural Language Toolkit). It helped us to better classify our dataset by using their wordnet and stopwords dataset.

In [2]:
# Here, we downloaded the NLTK data.
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/beyzabalota/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/beyzabalota/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

After the download, we imported our dataset, and initialized the lemmetizer from NLTK

In [3]:
train_file_path = 'bugs-train.csv'  # Loaded the dataset.
data = pd.read_csv(train_file_path)

lemmatizer = WordNetLemmatizer()
display(data)

if 'severity' in data.columns:
    print("\nCount of Each Unique Value in 'severity' Column:")
    print(data['severity'].value_counts())
    
print(data.shape)

Unnamed: 0,bug_id,summary,severity
0,365569,Remove workaround from bug 297227,normal
1,365578,Print Preview crashes on any URL in gtk2 builds,critical
2,365582,Lines are not showing in table,major
3,365584,Firefox render ÛÏsimplified ArabicÛ font fa...,normal
4,365597,Crash [@ nsINodeInfo::NodeInfoManager],critical
...,...,...,...
159993,1143381,block elements with height after float left or...,normal
159994,1143392,typing in google translate will send reset inp...,normal
159995,1143394,[gstreamer] Nightly instantly crashes on Youtu...,critical
159996,1143395,Right click on Flash object with accessibility...,critical



Count of Each Unique Value in 'severity' Column:
severity
normal         125854
critical        18658
major            6053
enhancement      4426
minor            3102
trivial          1204
blocker           701
Name: count, dtype: int64
(159998, 3)


After learning more information about our dataset, we wrote a function to preprocess the wanted texts.

In [4]:
def preprocess_text(text):
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\d', ' ', text)
    
    text = text.lower()
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words('english')]

    text = ' '.join(tokens)
    return text


Then we applied this function to the summary and severity columns of our dataset, to get a better learning result afterwards.

In [5]:
# Apply text preprocessing to the summary column
data['cleaned_summary'] = data['summary'].apply(preprocess_text)

# Label Encoding for the 'severity' column
label_encoder = LabelEncoder()
data['severity_encoded'] = label_encoder.fit_transform(data['severity'])


To vectorize the data, we have used TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. This helped us to understand which words are more important then others by looking at their count of appearance.

In [6]:
tfidf_vectorizer = TfidfVectorizer(max_features=5000) # Used TF-IDF vectorization with maximum of 5000 most frequent words.
tfidf_features = tfidf_vectorizer.fit_transform(data['cleaned_summary'])

tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out()) # Converting features to dataframe

processed_data = pd.concat([tfidf_df, data['severity_encoded']], axis=1)


Now is the time for training! For this, we split the training data to training and testing sets in a 80 to 20 fashion with the newly vectorized columns.

In [7]:
X = processed_data.drop('severity_encoded', axis=1)
y = processed_data['severity_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("First few rows of the training data:")
print(X_train.head())
print(y_train.head())


First few rows of the training data:
         __  _____  ________  __exposedprops__  __int  __proto__   _a  \
138894  0.0    0.0       0.0               0.0    0.0        0.0  0.0   
115279  0.0    0.0       0.0               0.0    0.0        0.0  0.0   
129928  0.0    0.0       0.0               0.0    0.0        0.0  0.0   
74987   0.0    0.0       0.0               0.0    0.0        0.0  0.0   
36702   0.0    0.0       0.0               0.0    0.0        0.0  0.0   

        _cairo_d  _demuxer  _max  ...  zombie  zone  zoom  zoomed  zooming  \
138894       0.0       0.0   0.0  ...     0.0   0.0   0.0     0.0      0.0   
115279       0.0       0.0   0.0  ...     0.0   0.0   0.0     0.0      0.0   
129928       0.0       0.0   0.0  ...     0.0   0.0   0.0     0.0      0.0   
74987        0.0       0.0   0.0  ...     0.0   0.0   0.0     0.0      0.0   
36702        0.0       0.0   0.0  ...     0.0   0.0   0.0     0.0      0.0   

        zwj  zwnj   û_   ûª  ûªt  
138894  0.0   0.0  0

After the splitting, we have trained our selected model xgboost. Extreme gradient boosting is especially helpful for supervised learning tasks.

In [8]:
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score

# We trained with the selected parameters, as we found these as the optimal ones from our grid search

xgb_params = {
    'objective': 'multi:softmax',
    'num_class': 6,
    'eval_metric': 'mlogloss',
    'learning_rate': 0.1, # Used this to prevent overfitting
    'max_depth': 6,  
    'n_estimators': 100,  
    'subsample': 0.8,  
    'colsample_bytree': 0.8, 
    'gamma': 0,  
    'min_child_weight': 1  
}

# Training part
xgb_model = xgb.XGBClassifier(**xgb_params)
xgb_model.fit(X_train, y_train)

# Made the predictions
y_pred = xgb_model.predict(X_test)

# The accuracy:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# And the macro precision
macro_precision = precision_score(y_test, y_pred, average='macro')
print(f"Macro Average Precision Score: {macro_precision}")

Accuracy: 0.85984375
Macro Average Precision Score: 0.8572595124084073


After the training of our model, we are continuing with the part of testing it and saving the predictions.

In [9]:
test_file_path = 'bugs-test.csv'  # Our test data
test_data = pd.read_csv(test_file_path)

test_data['cleaned_summary'] = test_data['summary'].apply(preprocess_text)
tfidf_features_test = tfidf_vectorizer.transform(test_data['cleaned_summary'])
tfidf_df_test = pd.DataFrame(tfidf_features_test.toarray(), columns=tfidf_vectorizer.get_feature_names_out())


test_predictions = xgb_model.predict(tfidf_df_test)
predicted_severity = label_encoder.inverse_transform(test_predictions)


To finish the prediction and model training part, we are writing the submissions to a new file, which we used to upload to Kaggle competition.

In [10]:

submission_df = pd.DataFrame({ 'bug_id': test_data['bug_id'], 'severity': predicted_severity })

submission_df.to_csv('submissionvol5.csv', index=False)
print("Predictions saved to 'submission.csv'.")


Predictions saved to 'submission.csv'.


This marks the end of the training part.