**Classification of Issues**

**Machine Learning Project on Github Issues**

Github Issue Classification


*   Predicting if Github Issue is a Bug, Enhancement or Question.

In [None]:
## Data Collection
#!wget https://github.com/niranjana1997/Classification-of-Issues-on-Github/blob/main/github-labels-top3-34k.csv

In [None]:
# Installing neattext library
!pip install neattext

Collecting neattext
  Downloading neattext-0.1.2-py3-none-any.whl (114 kB)
[?25l[K     |██▉                             | 10 kB 17.0 MB/s eta 0:00:01[K     |█████▊                          | 20 kB 7.8 MB/s eta 0:00:01[K     |████████▋                       | 30 kB 5.4 MB/s eta 0:00:01[K     |███████████▌                    | 40 kB 5.1 MB/s eta 0:00:01[K     |██████████████▍                 | 51 kB 3.8 MB/s eta 0:00:01[K     |█████████████████▏              | 61 kB 3.8 MB/s eta 0:00:01[K     |████████████████████            | 71 kB 4.0 MB/s eta 0:00:01[K     |███████████████████████         | 81 kB 4.5 MB/s eta 0:00:01[K     |█████████████████████████▉      | 92 kB 4.4 MB/s eta 0:00:01[K     |████████████████████████████▊   | 102 kB 3.9 MB/s eta 0:00:01[K     |███████████████████████████████▋| 112 kB 3.9 MB/s eta 0:00:01[K     |████████████████████████████████| 114 kB 3.9 MB/s 
[?25hInstalling collected packages: neattext
Successfully installed neattext-0.1.2


In [None]:
# Importing libraries
import pandas as pd # for data analysis
import neattext.functions as nfx # for text cleaning
import seaborn as sns

In [None]:
# Loading Dataset
df_csv = pd.read_csv('dataset/github-labels-top3-34k.csv', header=None, sep=',')

FileNotFoundError: ignored

In [None]:
df_csv.head()

#### Dataset preparation
+ extract the labels from the csv file
    - enhancement|bug|question

In [None]:
df_csv = df_csv[0].str.split(r'(__label__enhancement)|(__label__bug)|(__label__question)', expand=True)
#df_csv

In [None]:
# Creating three data frames for each label
enhancement_df = df_csv[df_csv[1] == '__label__enhancement'][[1,4]]
bug_df = df_csv[df_csv[2] == '__label__bug'][[2,4]]
question_df = df_csv[df_csv[3] == '__label__question'][[3,4]]

In [None]:
# Adding column names to the dataframes
enhancement_df.columns = ['label','description']
bug_df.columns = ['label','description']
question_df.columns = ['label','description']

In [None]:
# Concat Dataframes
final_df = pd.concat([enhancement_df, bug_df, question_df])

In [None]:
# Removing prefix __label__
final_df['label'] = final_df['label'].str.replace('__label__', '')

In [None]:
# Saving dataframe to csv file
final_df.to_csv("final_dataframe.csv")

In [None]:
final_df.head()

Class Distribution Analysis

In [None]:
final_df['label'].value_counts()

In [None]:
sns.countplot(x = 'label', data = final_df)

In [None]:
# Removing stopwords and converting the text into lower case
final_df['desc_clean'] = final_df['description'].apply(lambda x: nfx.remove_stopwords(str(x).lower()))

In [None]:
final_df.head()

Train-test split

In [None]:
# Data splitting
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(final_df['desc_clean'],final_df['label'],test_size=0.3,random_state=42)

**Model building**

**DecisionTreeClassifier - Naive Bayes - Logistic Regression**

These algorithms is used to build the issue classification model.

**Count Vectorizer**

This package enables machine learning models to understand the text. Machines learning models have a problem of understanding and using raw texts. However, machine learning models work well with numbers.

CountVectorizer converts the raw text into vectors of numbers. It ensures that the converted vectors of numbers represent the original text.

In [None]:
# Importing machine learning packages
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, plot_confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

The Pipeline package automates all the stages and processes used in building the model.

In [None]:
# Importing Pipeline package
from sklearn.pipeline import Pipeline

To automate the process of a model building using this Pipeline package, we initialize all the stages in building the model. After initializing the stages, they will be automated.

We have two stages as follows:

1.   CountVectorizer converting the input text to vectors of numbers.
2.   Using the DecisionTreeClassifier, Naive Bayes, Logistic Regression & Random forest algorithms to train the model.

In [None]:
# Making pipeline
pipeLine_nb = Pipeline(steps=[('cv', CountVectorizer()), ('nb', MultinomialNB())])
pipeLine_lr = Pipeline(steps=[('cv', CountVectorizer()), ('lr', LogisticRegression())])
pipeLine_dt = Pipeline(steps=[('cv', CountVectorizer()), ('dt', DecisionTreeClassifier())])
pipeLine_rf = Pipeline(steps=[('cv', CountVectorizer()), ('rf', RandomForestClassifier())])

In [None]:
# Build actual model - Naive Bayes
pipeLine_nb.fit(x_train, y_train)
y_pred_nb = pipeLine_nb.predict(x_test)
print(classification_report(y_test,y_pred_nb))

In [None]:
# Build actual model - DecisionTreeClassifier
pipeLine_dt.fit(x_train, y_train)
y_pred_dt = pipeLine_dt.predict(x_test)
print(classification_report(y_test,y_pred_dt))

In [None]:
# Build actual model - Logistic Regression
pipeLine_lr.fit(x_train, y_train)
pipeLine_lr.score(x_test, y_test)
y_pred_lr = pipeLine_lr.predict(x_test)
print(classification_report(y_test,y_pred_lr))

In [None]:
# Build actual model - Random Forest
pipeLine_rf.fit(x_train, y_train)
y_pred_rf = pipeLine_rf.predict(x_test)
print(classification_report(y_test,y_pred_rf))

In [None]:
# Make prediction
# Source: https://github.com/streamlit/streamlit/issues
# Bug - Keras load_img is not working if i display any image on the top of the page
# Enhancement - Feature request: Slider: negative space and histograms
# Question - Number Input Scientific Format
test1 = "Keras load_img is not working if i display any image on the top of the page"
test2 = "Feature request: Slider: negative space and histograms"
test3 = "Number Input Scientific Format"

In [None]:
# Making prediction - Naive Bayes - Bug
pipeLine_nb.predict([test1])

In [None]:
# Making prediction - Naive Bayes - Enhancement
pipeLine_nb.predict([test2])

In [None]:
# Making prediction - Naive Bayes - Question
pipeLine_nb.predict([test3])

In [None]:
# Making prediction - DecisionTreeClassifier - Bug
pipeLine_dt.predict([test1])

In [None]:
# Making prediction - DecisionTreeClassifier - Enhancement
pipeLine_dt.predict([test2])

In [None]:
# Making prediction - DecisionTreeClassifier - Question
pipeLine_dt.predict([test3])

In [None]:
# Making prediction - Logistic Regression - Bug
pipeLine_dt.predict([test1])

In [None]:
# Making prediction - Logistic Regression - Enhancement
pipeLine_dt.predict([test2])

In [None]:
# Making prediction - Logistic Regression - Question
pipeLine_dt.predict([test3])