In this notebook one can use all the issues made on github to train a Word2Vec model. This pre-trained W2V model can then be fed into a fastText model to classify github issues. This notebook includes gathering the data, processing the data inplace, and training the model with the loaded data. Since data can be streamed into the model one day at a time, no data has to be saved.

In [196]:
from google.cloud import bigquery
from google.oauth2 import service_account
import json
import re
import datetime
import pandas as pd
import numpy as np
import urllib
import zipfile
import os
import langid
from nltk.tokenize import word_tokenize
import nltk
from string import punctuation
from collections import Counter
from tqdm.notebook import tqdm
from gensim.models import Word2Vec, KeyedVectors

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/atersaak/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

First a google cloud project has to be made in order to use BigQUery to access the [GHArchive](https://www.gharchive.org/). The service account credentials can be stored in the root folder and the project id should match the one below. 

In [22]:
# save key .json file in the github labeler root
# project id on bigquery account should match

credentials = service_account.Credentials.from_service_account_file(
'../../github-issue-data-extraction-key.json')

project_id = 'github-issue-data-extraction'
client = bigquery.Client(credentials= credentials, project=project_id)

We define some simple functions to get the data in the format we wish, removing the quotes around text and deleting issues made by bots.

In [24]:
# simple preprocessing functions

def remove_quotes(string):
    """
    Remove quotes from the string (everything extracted from json has quotes)
    """
    if type(string) == str:
        return string[1:-1]
    else:
        return string

def is_bot(actor):
    """
    Identify users clearly tagged as bots
    """
    if type(actor) != str:
        return True
    if actor[-5:] == '[bot]':
        return True
    else:
        return False

We make a function that can take in a given day and return a dataframe of the github issues made on that day, witht the light processing applied above.

In [27]:
def get_data_for_day(day):
    """
    Pass in a datetime object and a dataframe of all the issue data from that day will be returned
    """
    date = day.strftime('%Y%m%d')
    response = client.query(f"""SELECT JSON_EXTRACT(payload, '$.issue.title') as title,
                                JSON_EXTRACT(payload, '$.issue.body') as body,
                                JSON_EXTRACT(payload, '$.issue.html_url') as url,
                                JSON_EXTRACT(payload, '$.issue.user.login') as actor
                                FROM githubarchive.day.{date}
                                WHERE type = 'IssuesEvent' AND JSON_EXTRACT(payload, '$.action') = '"opened"' 
                                """)
    df = response.to_dataframe()
    return df

def process_df(df):
    for col in df.columns:
        df[col] = df[col].apply(remove_quotes)
        df = df[~df[col].apply(is_bot)]
    df = df[~df[col].apply(is_bot)]
    return df

Now we define a slew of preprocessing functions that simplify the text data and make it easier for the Word2Vec models to understand them. We also use langid to check if the language of the issue is in English.

In [32]:
def is_english(text):
    """
    Determine if a language is English
    """
    return langid.classify(text)[0] == 'en'

In [None]:
### preprocess functions defined below

function_list = []

pattern = r"```.+?```"
code_block_regex = re.compile(pattern, re.DOTALL)


def code_block(string):
    """Replace code blocks with a CODE_BLOCK."""
    string = re.sub(code_block_regex, "CODE_BLOCK", string)
    return string


function_list.append(code_block)

pattern = r"`{1,2}.+?`{1,2}"
inline_code_regex = re.compile(pattern, re.DOTALL)


def code_variable(string):
    """Replace inline code with INLINE."""
    string = re.sub(inline_code_regex, " INLINE ", string)
    return string


function_list.append(code_variable)

pattern = r"\s@[^\s]+"
tagged_user_regex = re.compile(pattern)


def tagged_user(string):
    """Replace a user tagged with USER."""
    string = re.sub(tagged_user_regex, " USER ", string)
    return string


function_list.append(tagged_user)

pattern = r"[^\s]+\.(com|org|net|gov|edu|io|ai)[^\s]*"
url_regex = re.compile(pattern)


def urls(string):
    """Replace URLs with URL."""
    string = re.sub(url_regex, " URL ", string)
    return string


function_list.append(urls)

pattern = r"((\\r)*\\n)+"
enter_regex = re.compile(pattern, re.DOTALL)


def enters(string):
    """Replace newline characters with a space."""
    string = re.sub(enter_regex, " ", string)
    return string

function_list.append(enters)

pattern = r"#{3,}"
bold_regex = re.compile(pattern, re.DOTALL)


def bold(string):
    """Replace bold characters with a space."""
    string = re.sub(bold_regex, " ", string)
    return string

function_list.append(bold)

def remove_slashes(string):
    return string.replace('\\', '')

function_list.append(remove_slashes)

def preprocess(string):
    """Put all preprocessing functions together."""
    for func in function_list:
        string = func(string)
    return string

In [None]:
# function that will remove all punctuation that is not ending a sentence or a comma

punc = set(punctuation)

def is_punc(string):
    if string in ['.', '?', '!', ',']:
        return False
    for ch in string:
        if ch not in punc:
            return False
    return True

Now we will extract the github data to create a vocabulary set. A certain number of days can be specified here, and the data will begin from issues a week ago and continue extracting one day from every two weeks. The data is not saved, but the word counts are stored.

In [30]:
# here we download some data spaced out over about two years to build vocabulary

data = []

total_data = 0

curr_day = datetime.datetime.today().date() - datetime.timedelta(days = 7)

num_days = 1

cnt = Counter()

while num_days < 50:
    df = get_data_for_day(curr_day)
    df = process_df(df)
    inp = df['title'].fillna(' ') + ' SEP ' + df['body'].fillna(' ')
    inp = inp.apply(preprocess)
    inp = inp[inp.apply(is_english)]
    inp = inp.apply(lambda x: x.lower())
    inp = inp.apply(word_tokenize).values
    inp = [[word for word in issue if not is_punc(word)] for issue in inp]
    data += inp
    total_data += sum(df.memory_usage(deep = True))/1000000000
    if num_days + 1 % 3 == 0:
        print(f'{num_days} days and {round(total_data, 2)} GB looked at')
    curr_day -= datetime.timedelta(days = 14)
    num_days += 1
    cnt.update(d)
    
print(f'{len(cnt)} total words')

KeyboardInterrupt: 

We take a percentage of the top n words, here we take the top words that make up 80% of all words in the dataset we crawled.

In [18]:
total_num = sum([b for a, b in cnt.most_common()])
top_words = []
cutoff = 0.8*total_num
running = 0
for word, num in cnt.most_common():
    if running < cutoff:
        if word not in w.wv:
            top_words.append(word)
        running += num
    else:
        break

Now we download the pretrained model that was trained on wikipedia and the news. We delete noisy words that are unlikely to come up to reduce the size of the model using some criteria, and use PCA to reduce the vector size. We do this by looking at the sum of the singular values and making sure 70% of the sum of the SV's are covered by the reduced data's SV's.

In [74]:
# download pretrained english model

if not os.path.isfile('../models/wiki-news-300d-1M.vec'):
    urllib.urlretrieve("https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip",
                       "../models/wiki-news-300d-1M.vec.zip")
    with zipfile.ZipFile('../models/wiki-news-300d-1M.vec.zip', 'r') as zip_ref:
        zip_ref.extractall('../models')
    os.remove('../models/wiki-news-300d-1M.vec.zip')

In [108]:
model = KeyedVectors.load_word2vec_format('../models/wiki-news-300d-1M.vec')

In [194]:
print(len(model.vocab))
# remove capital letters
pretrained_vocab = [v for v in model.vocab.keys() if v.lower() == v]
print(len(pretrained_vocab))
# remove bigrams
pretrained_vocab = [v for v in pretrained_vocab if len(v.split('-')) < 2]
print(len(pretrained_vocab))
# remove words with nonlatin characters

import unicodedata as ud

# from stackexchange

latin_letters= {}

def is_latin(uchr):
    try: return latin_letters[uchr]
    except KeyError:
         return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))

def only_roman_chars(unistr):
    return all(is_latin(uchr)
           for uchr in unistr
           if uchr.isalpha()) # isalpha suggested by John Machin

pretrained_vocab = [v for v in pretrained_vocab if only_roman_chars(v)]

print(len(pretrained_vocab))

digits = set('0123456789')

def is_mostly_numeric(string):
    cnt = 0
    for s in string:
        if s in digits:
            cnt += 1
    if cnt/len(string) > .5:
        return True
    else:
        return False


pretrained_vocab = [v for v in pretrained_vocab if not is_mostly_numeric(v)]

print(len(pretrained_vocab))

999994
392610
296866
284422
244220


In [195]:
# reduce size

from sklearn.decomposition import PCA

data = np.stack([model[v] for v in pretrained_vocab])

pca = PCA()

pca.fit(data)

total_singular_values = sum(pca.singular_values_)

thresh = 0.7
running = 0
n_comps = 0
while running < thresh*total_singular_values:
    running += pca.singular_values_[n_comps]
    n_comps += 1
    
print(f'{n_comps} components explain {thresh} of the variation')

final_pca = PCA(n_components = n_comps)
final_pca.fit(data)

175 components exlain 0.7 of the variation


PCA(n_components=175)

Now we initialize our Word2Vec model and build the vocabulary. We join the resulting words from the pretrained model, add in the new words discovered from the issue data, as well as an "unknown" character. We load in the pretrained vocabulary into the Word2Vec model.

In [162]:
w = Word2Vec(size=n_comps, window=5, min_count=1, workers=4)



In [163]:
top_words = ['openshift', 'kubernetes']

In [165]:
w.build_vocab(sentences = [pretrained_vocab + top_words + ['_unknown_']])

In [172]:
for v in tqdm(pretrained_vocab):
    w.wv[v] = final_pca.transform([model[v]])[0]

  0%|          | 0/284422 [00:00<?, ?it/s]

In [203]:
def in_set(word):
    if word in w.wv:
        return word
    else:
        return '_unknown_'

In [204]:
old_vec = w.wv['apple']

Finally, we train Word2Vec on 50 days of data, reading in one day at a time. We start 10 days ago and shift back 2 weeks with each iteration. The model saves af

In [None]:
curr_day = datetime.datetime.today().date() - datetime.timedelta(days = 10)

num_days = 1

while num_days < 50:
    df = get_data_for_day(curr_day)
    df = process_df(df)
    df['proc'] = df['title'].fillna(' ') + ' SEP ' + df['body'].fillna(' ')
    df['proc'] = df['proc'].apply(preprocess)
    df = df[df['proc'].apply(is_english)]
    df['proc'] = df['proc'].apply(lambda x: x.lower())
    inp = df['proc'].apply(word_tokenize).values
    inp = [[in_set(word) for word in issue if not is_punc(word)] for issue in inp]
    curr_day -= datetime.timedelta(days = 14)
    w.save('w2v.model')
    print(f'{num_days} days completed')
    num_days += 1
    w.train(inp, total_examples = len(inp), epochs = 1)