In this notebook one can use all the issues made on github to train a Word2Vec model. This pre-trained W2V model can then be fed into a fastText model to classify github issues. This notebook includes gathering the data, processing the data inplace, and training the model with the loaded data. Since data can be streamed into the model one day at a time, no data has to be saved.

In [4]:
from google.cloud import bigquery
from google.oauth2 import service_account
import datetime
import numpy as np
import urllib
import zipfile
import os
from nltk.tokenize import word_tokenize
import nltk
from collections import Counter
from tqdm.notebook import tqdm
from gensim.models import Word2Vec, KeyedVectors
import unicodedata as ud
from sklearn.decomposition import PCA
from dotenv import find_dotenv, load_dotenv
import boto3
from w2v_preprocess import remove_quotes, is_bot, is_english, preprocess, is_punc

nltk.download('punkt')


[nltk_data] Downloading package punkt to /home/atersaak/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
load_dotenv(find_dotenv())

True

In [27]:
# whether to use ceph or store locally

use_ceph = True

if use_ceph:
    s3_endpoint_url = os.environ["OBJECT_STORAGE_ENDPOINT_URL"]
    s3_access_key = os.environ["AWS_ACCESS_KEY_ID"]
    s3_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
    s3_bucket = os.environ["OBJECT_STORAGE_BUCKET_NAME"]

    s3 = boto3.client(
        service_name="s3",
        aws_access_key_id=s3_access_key,
        aws_secret_access_key=s3_secret_key,
        endpoint_url=s3_endpoint_url,
    )

First a google cloud project has to be made in order to use BigQuery to access the [GHArchive](https://www.gharchive.org/). The service account credentials can be stored in the root folder and the project id should match the one below. 

In [2]:
# save key .json file in the github labeler root
# project id on bigquery account should match

credentials = service_account.Credentials.from_service_account_file(
    '../../github-issue-data-extraction-key.json')

project_id = 'github-issue-data-extraction'
client = bigquery.Client(credentials= credentials, project=project_id)

We make a function that can take in a given day and return a dataframe of the github issues made on that day, witht the light processing applied above.

In [4]:
def get_data_for_day(day):
    """
    Pass in a datetime object and a dataframe of all the issue data from that day will be returned
    """
    date = day.strftime('%Y%m%d')
    response = client.query(f"""SELECT JSON_EXTRACT(payload, '$.issue.title') as title,
                                JSON_EXTRACT(payload, '$.issue.body') as body,
                                JSON_EXTRACT(payload, '$.issue.html_url') as url,
                                JSON_EXTRACT(payload, '$.issue.user.login') as actor
                                FROM githubarchive.day.{date}
                                WHERE type = 'IssuesEvent' AND JSON_EXTRACT(payload, '$.action') = '"opened"'
                                """)
    df = response.to_dataframe()
    return df


def process_df(df):
    for col in df.columns:
        df[col] = df[col].apply(remove_quotes)
        df = df[~df[col].apply(is_bot)]
    df = df[~df[col].apply(is_bot)]
    return df

Now we will extract the github data to create a vocabulary set. A certain number of days can be specified here, and the data will begin from issues a week ago and continue extracting one day from every two weeks. The data is not saved, but the word counts are stored.

In [None]:
# here we download some data spaced out over about two years to build vocabulary


total_data = 0

curr_day = datetime.datetime.today().date() - datetime.timedelta(days = 7)

num_days = 1

cnt = Counter()

while num_days < 50:
    df = get_data_for_day(curr_day)
    df = process_df(df)
    inp = df['title'].fillna(' ') + ' SEP ' + df['body'].fillna(' ')
    inp = inp.apply(preprocess)
    inp = inp[inp.apply(is_english)]
    inp = inp.apply(lambda x: x.lower())
    inp = inp.apply(word_tokenize).values
    inp = [[word for word in issue if not is_punc(word)] for issue in inp]
    inp = [set(words) for words in inp]
    total_data += sum(df.memory_usage(deep = True))/1000000000
    if (num_days + 1) % 3 == 0:
        print(f'{num_days} days and {round(total_data, 2)} GB looked at')
    curr_day -= datetime.timedelta(days = 14)
    num_days += 1
    for d in inp:
        cnt.update(d)

print(f'{len(cnt)} total words')

Now we download the pretrained model that was trained on wikipedia and the news. We delete noisy words that are unlikely to come up to reduce the size of the model using some criteria. We then add in the words extracted from issues that comprise 95% of all words that don't already exist in our dataset. We use PCA to reduce the vector size of the words from the pretrained model. We do this by looking at the sum of the singular values and making sure 70% of the sum of the SV's are covered by the reduced data's SV's.

In [10]:
# download pretrained english model

if not os.path.isfile('../models/wiki-news-300d-1M.vec'):
    urllib.urlretrieve("https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip",
                       "../models/wiki-news-300d-1M.vec.zip")
    with zipfile.ZipFile('../models/wiki-news-300d-1M.vec.zip', 'r') as zip_ref:
        zip_ref.extractall('../models')
    os.remove('../models/wiki-news-300d-1M.vec.zip')

In [11]:
model = KeyedVectors.load_word2vec_format('../models/wiki-news-300d-1M.vec')

In [12]:
print(len(model.vocab))
# remove capital letters
pretrained_vocab = [v for v in model.vocab.keys() if v.lower() == v]
print(len(pretrained_vocab))
# remove bigrams
pretrained_vocab = [v for v in pretrained_vocab if len(v.split('-')) < 2]
print(len(pretrained_vocab))
# remove words with nonlatin characters

# from stackexchange

latin_letters= {}


def is_latin(uchr):
    try:
        return latin_letters[uchr]
    except KeyError:
        return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))


def only_roman_chars(unistr):
    return all(is_latin(uchr) for uchr in unistr if uchr.isalpha())


pretrained_vocab = [v for v in pretrained_vocab if only_roman_chars(v)]

print(len(pretrained_vocab))

digits = set('0123456789')


def is_mostly_numeric(string):
    cnt = 0
    for s in string:
        if s in digits:
            cnt += 1
    if cnt/len(string) > .5:
        return True
    else:
        return False


pretrained_vocab = [v for v in pretrained_vocab if not is_mostly_numeric(v)]

print(len(pretrained_vocab))

999994
392610
296866
284422
244220


In [19]:
vocab_set =set(pretrained_vocab)
total_num = sum([b for a, b in cnt.most_common()])
top_words = []
cutoff = 0.95*total_num
running = 0
for word, num in cnt.most_common():
    if running < cutoff:
        if word not in vocab_set:
            top_words.append(word)
        running += num
    else:
        break
del cnt
print(f'{len(top_words)} words added from random github issues')

142484 words added from random github issues


In [20]:
# reduce size

data = np.stack([model[v] for v in pretrained_vocab])

pca = PCA()

pca.fit(data)

total_singular_values = sum(pca.singular_values_)

thresh = 0.7
running = 0
n_comps = 0
while running < thresh*total_singular_values:
    running += pca.singular_values_[n_comps]
    n_comps += 1

print(f'{n_comps} components explain {thresh} of the variation')

final_pca = PCA(n_components = n_comps)
final_pca.fit(data)

175 components explain 0.7 of the variation


PCA(n_components=175)

Now we initialize our Word2Vec model and build the vocabulary. We join the resulting words from the pretrained model, add in the new words discovered from the issue data, as well as an "unknown" character.

In [21]:
w = Word2Vec(size=n_comps, window=5, min_count=1, workers=4)



In [22]:
w.build_vocab(sentences = [pretrained_vocab + top_words + ['_unknown_']])

In [23]:
for v in tqdm(pretrained_vocab):
    w.wv[v] = final_pca.transform([model[v]])[0]
del model

  0%|          | 0/244220 [00:00<?, ?it/s]

We save the model in Ceph. Since it stores multiple files, we must account for this.

In [51]:
w.save('../../models/w2v.model')

if use_ceph:
    for file in os.listdir('../../models/'):
        if 'w2v.model' in file:
            response = s3.upload_file(
                Bucket=s3_bucket,
                Key=f"github-labeler/w2v/{file}",
                Filename=f'../../models/{file}',
            )
            os.remove(f'../../models/{file}')