Import the required packages and set up the Spark session. This would not be required if using DataBricks.

In [None]:
import findspark
findspark.init()
findspark.find()

In [None]:
from pyspark.sql import SparkSession
import pyspark;
spark = SparkSession.builder.appName('A2').getOrCreate();

Note: We will be using Spark for GridSearchCV only. This appears to be the only sklearn method currently supported by Spark. The remaining code will be executed using Pandas dataframes.

Load the CSV file (n=1000 samples) containing our manual labels as the target vector

In [None]:
df = (spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .option("multiline", "true")
  .option("quote", '"')  
  .option("escape", "\\")
  .option("escape", '"')
  .load("GH-React.csv")
)

In [6]:
import pandas as pd

df = pd.read_csv("./GH-React.csv")

In [None]:
print("Shape:", df.shape)
df.columns

## Machine Learning Pipeline
### Stage 1
The preprocess() function is defined below. It takes in a String formatted as Markdown from GitHub and pre-processes it to return a new string ready for the next stages in our ML Pipeline.

In [52]:
import re

def preprocess(text):
    stripped = text.lower()

    # remove all headings, bold text, and HTML comments from the Markdown text.
    # These items have all been used by the React team in their issue templates on GitHub
    headings_pattern = r'(<=\s|^)#{1,6}(.*?)$'
    bold_pattern = r'\*\*(.+?)\*\*(?!\*)'
    comments_pattern = r'<!--((.|\n)*?)-->'
    combined_pattern = r'|'.join((headings_pattern, bold_pattern, comments_pattern))

    stripped = re.sub(combined_pattern, '', stripped)

    # find all URLs in the string, and then remove the final directory from each to leave the general URL form
    # there may be useful patterns based on what URLs issues are commonly linking to
    url_pattern = re.compile(r'(https?://[^\s]+)')
    for url in re.findall(url_pattern, stripped):
        new_url = url.rsplit("/", 1)[0]
        stripped = stripped.replace(url, new_url)

    non_alpha_pattern = r'[^A-Za-z ]+'
    stripped = re.sub(non_alpha_pattern, '', stripped)    
    
    return ' '.join(stripped.split())

In [53]:
test = df['body'][4]
preprocess(test)

'bug or undefined behaviourdoingreactchildrentoarray reactdomcreateportalfails withobjects are not valid as a react child found object with keys typeof key children containerinfo implementation if you meant to render a collection of children use an array insteadnamely the following complete snippet failsjsximport react from reactimport render createportal from reactdomconst renderchildren children children reactchildrentoarraychildren return hrenders children with toarray childrenhconst app renderchildren namecodesandbox createportaldivrendered in portaldiv documentgetelementbyidportal renderchildrenrenderapp documentgetelementbyidrootwhile the following one which wraps the portal in another element works just finejsximport react from reactimport render createportal from reactdomconst renderchildren children children reactchildrentoarraychildren return hrenders children with toarray childrenhconst app renderchildren namecodesandbox div createportaldivrendered in portaldiv documentgetel

### Stage 2
Split the data into training (80%) and validation(20%) sets. We will stratify based on the label since our dataset is imbalanced.

In [9]:
y = df['Target']
data = df.drop(['Target'], axis=1)

In [12]:
from sklearn.model_selection import train_test_split

data_train, data_val, y_train, y_val = train_test_split(data, y, train_size=0.8, stratify=y, random_state=1)

print(data_train.shape)
print(data_val.shape)

(800, 15)
(200, 15)


### Stage 3
Create a TF-IDF features matrix using TfidfVectorizer from sklearn applied to the title and body of each issue.

We will additionally add in the feature 'author_association' from the GitHub issue, as there may be a correlation between Members/Collaborators/Contributors submitting more valid bugs/feature requests than "None" users.

While lemmatization could have been done earlier in the pre-processsing stage, it is more efficient to lemmatize at this point in a custom_tokenizer() function passed to TfidfVectorizer since tokenization is part of both processses.

First, define the tokenizer and vectorizer:

In [None]:
%pip install spacy
!python -m spacy download en_core_web_sm

In [42]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

nlp = spacy.load("en_core_web_sm")

# create a custom tokenizer using the spacy document processing pipeline
def custom_tokenizer(document):
    ppd = preprocess(document)
    doc = nlp(ppd)
    return [token.lemma_ for token in doc]

vect = TfidfVectorizer(tokenizer=custom_tokenizer, ngram_range=(1, 2))

Create the features matrix:

### TODO: Need to convert the below effort into a columnular transformer. It cannot be a manual effort if we want to use this in grid search later on.

In [14]:
# combine the title and body into one column
text = (data_train['title'] + ' ' + data_train['body']).values.astype('U')

In [54]:
# create text_features using our vectorizer
text_features = vect.fit_transform(text)

In [57]:
import numpy as np
sorted_by_idf = np.argsort(vect.idf_)
feature_names = np.array(vect.get_feature_names())

print("Features with lowest idf:\n{}".format(feature_names[sorted_by_idf[:20]]))
print("Features with highest idf:\n{}".format(feature_names[sorted_by_idf[-100:]]))

Features with lowest idf:
['the' 'be' 'to' 'not' 'in' 'a' 'and' 'react' 'this' 'I' 'it' 'of' 'do'
 'with' 'use' 'that' 'for' 'version' 'bug' 'when']
Features with highest idf:
['getcomponentname to' 'getcomponentnamefromfiber'
 'getcomponentnamefromfiber although' 'getcomponentnamefromtype'
 'getcomponentnamefromtype getcomponentnamefromfiber'
 'getcomponentnamefromtype in' 'getcomponentthisgetcomponent'
 'getcomponentthisgetcomponent export' 'getcssmodulelocalident'
 'getcssmodulelocalident lessloader' 'getcurrenttimeshouldyieldtohost'
 'getcurrenttimeshouldyieldtohost return' 'getcurrentvalue'
 'getcurrentvalue chromeextensionfmkadmapgofadopljbjfkapdkoienihibuildmainj'
 'getcursor consolelogccstatestate' 'getcursor const'
 'getchildhostcontextcontext fibertype' 'getdata const'
 'getchildhostcontextcontext' 'getchildhostcontext hi' 'get theme'
 'get throw' 'get to' 'get typeerror' 'get use' 'get value' 'get with'
 'get zero' 'getaudiocontext' 'getaudiocontext setaudiotrack'
 'getaudio

In [None]:
from sklearn.preprocessing import OneHotEncoder

# use one hot encoder to transform the author_association to a feature set
ohe = OneHotEncoder()
author_association = ohe.fit_transform(data_train['author_association'].to_numpy().reshape(-1, 1))

In [None]:
from scipy.sparse import hstack

# create our final features matrix by combining both sets
X_train = hstack((text_features, author_association))

In [None]:
print("Shape of feature matrix:", X_train.shape)
print("Shape of target vector:", y_train.shape)

### TODO: See if we can visualize our text features to make sure it seems logical. Take an example from Chapter 7 in ML book.

### Stage 4
Use cross-validate with our training set to test our model parameters

#### Grid Search for optimizing model

Use grid search to find potentially better model parameters:

In [None]:
%pip install joblibspark

#TODO: Need to make the features vector matrix creation into a pipeline or function so we can feed it into grid search. It cannot be a manual effort.

could use the columnular transformer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import numpy as np

# Use spark_sklearn’s grid search instead:
from sklearn.model_selection import GridSearchCV
from joblibspark import register_spark
from sklearn.utils import parallel_backend

register_spark() # register spark backend

pipe = make_pipeline(TfidfVectorizer(tokenizer=custom_tokenizer), LogisticRegression())

param_grid = {"logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100],
                "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)]}

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(data_train, y_train)

In [None]:
print("Best parameters: {}".format(grid.best_params_))
print("Best cross-validation score: {:.2f}".format(grid.best_score_))

### Stage 5
Validate our chosen model against the validation set we saved in Stage 3