# Text analysis of issue labels

First, import all the necessary packages

In [1]:
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import re
ps = nltk.PorterStemmer()
from nltk.probability import FreqDist

### Read in the data
For this analysis, GitHub issue labels are not grouped by 'organization' or 'repository' instead, all 200+ issue labels are listed as a vectory under the 'issue_labels' column.

In [2]:
issue_data = pd.read_csv('data/whole_corpus.csv', header=0)
issue_data.issue_labels = issue_data.issue_labels.astype(str)


In [3]:
issue_data.head()

Unnamed: 0,organization,issue_labels
0,whole_corpus,"breaking-change,datamanager,dependencies,eml-p..."


### Tokenization

Now, create and run the `tokenizer` function. This will take all of the issue labels, recognize that they are separated by commas, and then "tokenize" words found in the same GitHub repo into a list of terms that is much easier to analyze. You'll see the tokenized words under the `text_tokenized` column.

In [4]:
def tokenize(text):
    tokens = re.split(',', text)
    return tokens

issue_data['text_tokenized'] = issue_data['issue_labels'].apply(lambda x: tokenize(x.lower()))

issue_data.head()

Unnamed: 0,organization,issue_labels,text_tokenized
0,whole_corpus,"breaking-change,datamanager,dependencies,eml-p...","[breaking-change, datamanager, dependencies, e..."


### Stemming

use the `PorterStemmer` function from the natural langue toolkit  and it's from the Natural Language Toolkit package

In [5]:
ps = nltk.PorterStemmer()

Here's the function that will 'stem' the labels. Stemming  trims each of the words in our dataset down to a shortened or 'stemmed' version of that word to faciliate term grouping.

In [6]:
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text]
    return text

issue_data['labels_stemmed'] = issue_data['text_tokenized'].apply(lambda x: stemming(x))

issue_data.head()

Unnamed: 0,organization,issue_labels,text_tokenized,labels_stemmed
0,whole_corpus,"breaking-change,datamanager,dependencies,eml-p...","[breaking-change, datamanager, dependencies, e...","[breaking-chang, datamanag, depend, eml-pars, ..."


Next, count the stemmed words. Note that some words (like variations on 'priority') are captured in a few different stems and it may require some further visual classification to group these words that appear in multiple stems.

In [7]:
counts_of_labels = pd.value_counts(issue_data.labels_stemmed.apply(pd.Series).stack())

In [8]:
pd.set_option('display.max_rows', None)
counts_of_labels.reorder_levels

<bound method Series.reorder_levels of high prior                                3
new class                                 3
type-defect                               3
depend                                    2
mix                                       2
type-enhanc                               2
low prior                                 2
type-review                               2
typo                                      2
annot                                     2
hierarchi                                 2
 docs - simple darwin cor                 2
in progress                               2
usabl                                     2
prioriti                                  2
cmip6                                     1
planet microb                             1
question - document on dwc q&a            1
site-maintenan                            1
design feedback                           1
eo-qb                                     1
ugagre                               