Import the required packages and set up the Spark session. This would not be required if using DataBricks.

In [1]:
import findspark
findspark.init()
findspark.find()

'/home/graeme/anaconda3/envs/ENSF-612/lib/python3.9/site-packages/pyspark'

In [76]:
from pyspark.sql import SparkSession
import pyspark;
spark = SparkSession.builder.appName('A2').getOrCreate();

Note: We will be using Spark for GridSearchCV only. This appears to be the only sklearn method currently supported by Spark. The remaining code will be executed using Pandas dataframes.

Load the CSV file (n=1000 samples) containing our manual labels as the target vector

In [13]:
df = (spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .option("multiline", "true")
  .option("quote", '"')  
  .option("escape", "\\")
  .option("escape", '"')
  .load("GH-React.csv")
)

In [1]:
import pandas as pd

df = pd.read_csv("./GH-React.csv")

In [7]:
print("Shape:", df.shape)
df.columns

Shape: (1000, 16)


Index(['html_url', 'number', 'title', 'labels', 'state', 'locked', 'milestone',
       'comments', 'created_at', 'updated_at', 'closed_at',
       'author_association', 'state_reason', 'assignee.login', 'body',
       'Target'],
      dtype='object')

## Machine Learning Pipeline
### Stage 1
The preprocess() function is defined below. It takes in a String formatted as Markdown from GitHub and pre-processes it to return a new string ready for the next stages in our ML Pipeline.

In [13]:
import re

def preprocess(text):
    stripped = text.lower()

    # remove all headings, bold text, and HTML comments from the Markdown text.
    # These items have all been used by the React team in their issue templates on GitHub
    headings_pattern = r'(<=\s|^)#{1,6}(.*?)$'
    bold_pattern = r'\*\*(.+?)\*\*(?!\*)'
    comments_pattern = r'<!--((.|\n)*?)-->'
    combined_pattern = r'|'.join((headings_pattern, bold_pattern, comments_pattern))

    stripped = re.sub(combined_pattern, '', stripped)

    # find all URLs in the string, and then remove the final directory from each to leave the general URL form
    # there may be useful patterns based on what URLs issues are commonly linking to
    url_pattern = re.compile(r'(https?://[^\s]+)')
    for url in re.findall(url_pattern, stripped):
        new_url = url.rsplit("/", 1)[0]
        stripped = stripped.replace(url, new_url)
    
    non_alpha_pattern = r'[^A-Za-z\n ]+'
    stripped = re.sub(non_alpha_pattern, '', stripped)
    
    return stripped

In [89]:
test = df['body'][4]
preprocess(test)

'\n\n\n\nbug or undefined behaviour\n\n\n\ndoing\n\nreactchildrentoarray\n  reactdomcreateportal\n\n\n\nfails with\n\nobjects are not valid as a react child found object with keys typeof key children containerinfo implementation if you meant to render a collection of children use an array instead\n\n\nnamely the following complete snippet fails\n\njsx\nimport react from react\nimport  render createportal  from reactdom\n\nconst renderchildren   children   \n  children  reactchildrentoarraychildren\n  return hrenders children with toarray childrenh\n\n\n\nconst app     \n  renderchildren namecodesandbox\n    createportaldivrendered in portaldiv documentgetelementbyidportal\n  renderchildren\n\n\nrenderapp  documentgetelementbyidroot\n\n\nwhile the following one which wraps the portal in another element works just fine\n\njsx\nimport react from react\nimport  render createportal  from reactdom\n\nconst renderchildren   children   \n  children  reactchildrentoarraychildren\n  return hrend

### Stage 2
Split the data into training (80%) and validation(20%) sets. We will stratify based on the label since our dataset is imbalanced.

In [98]:
y = df['Target']
data = df.drop(['Target'], axis=1)

In [116]:
from sklearn.model_selection import train_test_split

data_train, data_val, y_train, y_val = train_test_split(data, y, train_size=0.8, stratify=y, random_state=1)

print(X_train.shape)
print(X_val.shape)

(800, 15)
(200, 15)


### Stage 3
Create a TF-IDF features matrix using TfidfVectorizer from sklearn applied to the title and body of each issue.

We will additionally add in the feature 'author_association' from the GitHub issue, as there may be a correlation between Members/Collaborators/Contributors submitting more valid bugs/feature requests than "None" users.

While lemmatization could have been done earlier in the pre-processsing stage, it is more efficient to lemmatize at this point in a custom_tokenizer() function passed to TfidfVectorizer since tokenization is part of both processses.

First, define the tokenizer and vectorizer:

In [108]:
%pip install spacy
!python -m spacy download en_core_web_sm

Note: you may need to restart the kernel to use updated packages.
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [109]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.pipeline import make_pipeline

nlp = spacy.load("en_core_web_sm")

# create a custom tokenizer using the spacy document processing pipeline
def custom_tokenizer(document):
    ppd = preprocess(document)
    doc = nlp(ppd)
    return [token.lemma_ for token in doc]

vect = TfidfVectorizer(tokenizer=custom_tokenizer, ngram_range=(1, 2))

Create the features matrix:

In [117]:
# combine the title and body into one column
text = (data_train['title'] + ' ' + data_train['body']).values.astype('U')

In [118]:
# create text_features using our vectorizer
text_features = vect.fit_transform(text)

In [119]:
from sklearn.preprocessing import OneHotEncoder

# use one hot encoder to transform the author_association to a feature set
ohe = OneHotEncoder()
author_association = ohe.fit_transform(data_train['author_association'].to_numpy().reshape(-1, 1))

In [120]:
from scipy.sparse import hstack

# create our final features matrix by combining both sets
X_train = hstack((text_features, author_association))

In [121]:
print("Shape of feature matrix:", X_train.shape)
print("Shape of target vector:", y_train.shape)

Shape of feature matrix: (800, 63479)
Shape of target vector: (800,)


### Stage 4
Use cross-validate with our training set to test our model parameters

#### Grid Search for optimizing model

Use grid search to find potentially better model parameters:

In [83]:
%pip install joblibspark

Collecting joblibspark
  Downloading joblibspark-0.5.0-py3-none-any.whl (15 kB)
Installing collected packages: joblibspark
Successfully installed joblibspark-0.5.0
Note: you may need to restart the kernel to use updated packages.


In [135]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import numpy as np

# Use spark_sklearn’s grid search instead:
from sklearn.model_selection import GridSearchCV
from joblibspark import register_spark
from sklearn.utils import parallel_backend

register_spark() # register spark backend

pipe = make_pipeline(TfidfVectorizer(tokenizer=custom_tokenizer), LogisticRegression())

param_grid = {"logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100],
                "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)]}

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(data_train, y_train)

ValueError: 
All the 90 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "/home/graeme/anaconda3/envs/ENSF-612/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/graeme/anaconda3/envs/ENSF-612/lib/python3.9/site-packages/sklearn/pipeline.py", line 382, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/graeme/anaconda3/envs/ENSF-612/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1138, in fit
    X, y = self._validate_data(
  File "/home/graeme/anaconda3/envs/ENSF-612/lib/python3.9/site-packages/sklearn/base.py", line 596, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/home/graeme/anaconda3/envs/ENSF-612/lib/python3.9/site-packages/sklearn/utils/validation.py", line 1092, in check_X_y
    check_consistent_length(X, y)
  File "/home/graeme/anaconda3/envs/ENSF-612/lib/python3.9/site-packages/sklearn/utils/validation.py", line 387, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [15, 640]


In [None]:
print("Best parameters: {}".format(grid.best_params_))
print("Best cross-validation score: {:.2f}".format(grid.best_score_))

### Stage 5
Validate our chosen model against the validation set we saved in Stage 3