Import the required packages and set up the Spark session. This would not be required if using DataBricks.

In [1]:
import findspark
findspark.init()
findspark.find()

'/home/graeme/anaconda3/envs/ENSF-612/lib/python3.9/site-packages/pyspark'

In [76]:
from pyspark.sql import SparkSession
import pyspark;
spark = SparkSession.builder.appName('A2').getOrCreate();

Load the CSV file (n=1000 samples) containing our manual labels as the target vector

In [13]:
df = (spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .option("multiline", "true")
  .option("quote", '"')  
  .option("escape", "\\")
  .option("escape", '"')
  .load("GH-React.csv")
)

In [80]:
def sparkShape(df):
    return (df.count(), len(df.columns))

In [81]:
print("Shape:",sparkShape(df))
df

Shape: (1000, 16)


DataFrame[html_url: string, number: int, title: string, labels: string, state: string, locked: boolean, milestone: string, comments: int, created_at: timestamp, updated_at: timestamp, closed_at: timestamp, author_association: string, state_reason: string, assignee.login: string, body: string, Target: string]

## Machine Learning Pipeline
### Stage 1
The preprocess() function is defined below. It takes in a String formatted as Markdown from GitHub and pre-processes it to return a new string ready for the next stages in our ML Pipeline.

In [75]:
import re

def preprocess(text):
    stripped = text.lower()

    # remove all headings, bold text, and HTML comments from the Markdown text.
    # These items have all been used by the React team in their issue templates on GitHub
    headings_pattern = r'(<=\s|^)#{1,6}(.*?)$'
    bold_pattern = r'\*\*(.+?)\*\*(?!\*)'
    comments_pattern = r'<!--((.|\n)*?)-->'
    combined_pattern = r'|'.join((headings_pattern, bold_pattern, comments_pattern))

    stripped = re.sub(combined_pattern, '', stripped)

    # find all URLs in the string, and then remove the final directory from each to leave the general URL form
    # there may be useful patterns based on what URLs issues are commonly linking to
    url_pattern = re.compile(r'(https?://[^\s]+)')
    for url in re.findall(url_pattern, stripped):
        new_url = url.rsplit("/", 1)[0]
        stripped = stripped.replace(url, new_url)
    
    non_alpha_pattern = r'[^A-Za-z\n ]+'
    stripped = re.sub(non_alpha_pattern, '', stripped)
    
    return stripped

In [31]:
test = df.select('body').take(2)
test = test[1].body
test


'<!--\n  Note: if the issue is about documentation or the website, please file it at:\n  https://github.com/reactjs/reactjs.org/issues/new\n-->\n\n**Do you want to request a *feature* or report a *bug*?**\nBug\n\n**What is the current behavior?**\nA specific order of unmounting and remounting `unstable_createReturn`s from `react-call-return` causes an invariant violation in `unmountHostComponents`.\n\n**Reproduce**\nThe following sandbox example crashes with an invariant violation when both the `min` and `cycle` props are *odd* numbers greater than zero.\n\nhttps://codesandbox.io/s/llyjz19rz7\n\n**What is the expected behavior?**\nThe app does not crash and cycles the number of items in the list.\n\n**Which versions of React, and which browser / OS are affected by this issue? Did this work in previous versions of React?**\n`react` and `react-dom` versions 16.1 and newer, `react-call-return` version 0.5.0\n'

In [74]:
preprocess(test)




bug


a specific order of unmounting and remounting unstablecreatereturns from reactcallreturn causes an invariant violation in unmounthostcomponents


the following sandbox example crashes with an invariant violation when both the min and cycle props are odd numbers greater than zero

httpscodesandboxios


the app does not crash and cycles the number of items in the list


react and reactdom versions  and newer reactcallreturn version 



### Stage 2
Create a TF-IDF features matrix using TfidfVectorizer from sklearn applied to the title and body of each issue.

We will additionally add in the feature 'author_association' from the GitHub issue, as there may be a correlation between Members/Collaborators/Contributors submitting more valid bugs/feature requests than "None" users.

While lemmatization could have been done earlier in the pre-processsing stage, it is more efficient to lemmatize at this point in a custom_tokenizer() function passed to TfidfVectorizer since tokenization is part of both processses.

First, define the tokenizer and vectorizer:

In [None]:
# Do not run yet. I'm not sure if we want to use this rexex tokenizer instead of built in lemma tokenizer

# Technicality: we want to use the regexp-based tokenizer
# that is used by CountVectorizer and only use the lemmatization
# from spacy. To this end, we replace en_nlp.tokenizer (the spacy tokenizer)
# with the regexp-based tokenization.
import re
# regexp used in CountVectorizer
regexp = re.compile('(?u)\\b\\w\\w+\\b')
# load spacy language model and save old tokenizer
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
# replace the tokenizer with the preceding regexp
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(
regexp.findall(string))

In [None]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.pipeline import make_pipeline

en_nlp = spacy.load('en')

# create a custom tokenizer using the spacy document processing pipeline
def custom_tokenizer(document):
    doc_spacy = en_nlp(document, entity=False, parse=False)
    return [token.lemma_ for token in doc_spacy]

vect = TfidfVectorizer(tokenizer=custom_tokenizer, ngram_range=(1, 2))

Create the features matrix:

In [None]:
X = vect.fit_transform(text)

Create the target vector:

In [83]:
y = df.select('Target')

In [82]:
print("Shape of feature matrix:", sparkShape(X))
print("Shape of target vector:", sparkShape(y))

Shape of target vector: (1000, 1)


### Stage 3
Split the data into training (80%) and validation(20%) sets. We will stratify based on the label since our dataset is imbalanced.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, random_state=0)

### Stage 4
Use cross-validate with our training set to test our model parameters

#### Grid Search for optimizing model

Use grid search to find potentially better model parameters:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(TfidfVectorizer(tokenizer=custom_tokenizer), LogisticRegression())

param_grid = {"logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100],
                "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)],
                "tfidfvectorizer__min_df": [0, 3, 5]}

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(text_train, y_train)

print("Best cross-validation score: {:.2f}".format(grid.best_score_))

### Stage 5
Validate our chosen model against the validation set we saved in Stage 3